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ABSTRACT ^ 

A series of five papers is presented* Sherry 
Rubinstein, and others, characterize the changes of teacher 
certification programs and reflect on the factors propelling and 
influencing the direction of those changes such as increased emphasis 
on the descriptioii and testing of the skills and knowledge of 
prospective teachers, and the adoption of criterion-referenced 
measures to assess teacher skills and knowledge. Katharine Vorwerk 
and William Gorth present a general model for developing the formal 
testing component of a certification program. The model includes: (1) 
developing certification requirements; (2) deciding hew to assess 

ng 




and using - . . . 

approaches to assessment for initial teacher certification. 
Conceptual issues are considercid in relation to test design, 
assessment for entry to a teacher education program, exit 
credentialing, and classroom performance assessment. Paula Nassif 
eviews technical issues of teacher certification testing, focusing 
n standard setting and equating, and validity and job analysis. 
Sbott Elliot presents current applications of job analysis 
me^thodology to teacher certification testing. (PN) 
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Introduction - ^ . , 

Early in the nineteenth century, the sole credential required of ^Jtc^chers of 
public school children was basic proficiency in reading, writing, and arith-^ 
metic. With the advent of mass compulsory education later in the century came^ 
the states' interests in extending these criteria to include proficiency/in 
professional, techniques and specific subject-matter knowledge. Thesi three 
aspects— basic skills, competence in teaching techniques, and knowledge of 
subject matter to be taught.--have continued through to the present as the 
mainstays of teacher assessment systems. 

This characterization seems to suggest and underscore a considerable con- 
sensus about and continuity over time in the important aspects of teacher- 
evaluation— although some would rather interpret this status as a reflection 
of the slow growth in. our understanding of the elements of effective teaching 
and how to test for -their presence. Naysayers notwithstanding, the last 
decade has been marked by dramatic change in appro?ches "to^credefitialing 
public school teachers. The change has been not so much in the primary 
'domains of competence subjected to scrutiny, but in the ^degree of emphasis 
accorded them, the manner in which they are characterized, and the manner, in 
which they are assessed". - - . _ 

The nature of the change in credential ing practice is evidenced oy -the 
significant "nationwide increase in efforts to reexamine and modify those 
state-level programsVpharged with the responsibility of licensing teachers. 
Licensure , is. the "process by which an agency of . the government grants^ 
permission to an individual, to engage in a given occupation upon finding that 
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the applicant has attained the minimal degree of coiipetency required to ensure 
.that- the public health, safety, and welfare will be reasonably well protected" 
(U.S. Department of Health, Education, and Welfare, 1977, p. 4). An individual 
without a teaching license from a particular state is legally barred from the 
practice of public school teaching in that state. The closely related process 
of certification, grants the use of a Utle (e.g., "teacher") W an individual 
who has met a predetermined set of standards or qualifications set by a 
credentialing agency (Shimburg, 1981). This distinction between licensure and 
certification having been made for .the- record; the commonly used, generic 
referent "teacher certification program" hereinafter will denote individual 
state government policies and procedures regarding the granting of teacher 
lice|ises. 

Priur to the late 1960s, most states credentialed prospective teachers on 
the basis of successful- completion Of a teacher education program of study. 
Only some states 'went so far as to require accreditation or "approval" of such 
>ogram-s.r and on-ly some -states. --took ■the_..additiona1_measur e of re quiring 
entrants into the teaching field to pass a nationally standardized,' norm- 
referenced test. SucfTs^tate-porlicies-had-been-stab-le. for a considerable length, 
of time, which suggested a prevailing opinion that certif icat'ion programs were 
fulfilling their purpose. From the lack of controversy, one' could conclude 
that most groups and individuals concerned with publi;c education were satisfi"ed 
that these programs were adequate to ensure that unqualified individuals were 
excluded from teaching and that all qualified applicants had fair and unbiased 
access to the profess i on 



The decade of the 1970s stands in marked contrast. During this time, 
teacher certification programs were taken to task by a variety of interest 
groups concerned with the quality of teaching in the nation's schools, and 
state departments of education faced strong and often contradictory demands 
for change. As a result, teacher certification programs were subjected to 
considerable scrutiny and underwent extensive changes. The purpose of this 
paper is to characterize these changes-parti cular.ly those related to tests 
and measures—and to reflect- on the factors that perhay propelled and cer- 
tainly influenced the direction of those changes. In doing so, the author^ 

— ' - - 

will first call upon empirical evidence to document the existence and extent 
of the change observed and argue that the significant features of the change 
are (a) new and different emphases in the description and testing^ of the 
skills and knowledge which, prospective-, teachers should "possess; and 
(b) increasing adoption of criterion-referenced measures to assess the skills 
and knowledge so described. These ch"anges will then be analyzed in terms of 
"their relationship to events and factors in three separate ' spheres: (1) the 
. general political environment, '.(2) the legal/regulatory environment, and 
(3) the educational/research environment. In suninary, the authors will 
conclude that the changes are having" or will have-"a variety «of positive 
effects. 



Evidence of Change 



Substantiating claims of change in teacher certification progranjs is> nOt a 
difficult task. That" change "in the air" was evident and was publicized as 
early as 1975 when a study by Pittman (1975) revealed that between 1970 anjj 



1975 every state in the Union had considered the idea of modifying teacher 
certification practices to incorporate the then-new principles of competency- 
based edtscation. This spate of activity took a variety or forms including the 
appointment of study panels, the commissioning of position papers, the hosting 
of conferences, and the review of concrete proposals. These activities at a 
minimum suggested an interest in re-analyzing teacher certification require- 
ments and, in- a signifidant number of cases, this interest was followed 'by 
action. A number of states made significant modifications to their existing 
certification programs; , others chose to design totally new programs to replace 
existing ones. Changes were variously brought to bear on the policies and 
practices of all four .phases of teacher certification programs, those effective; 

(1) upon admission to teacher training programs; (2) upon completion of such 

* 

a program (initial certification); (3) during the first jfear of incumbency in 
a teaching position; and (4) dUjring later incumbency (certification renewal). — . 

One major form of revision affected the common policy that automatically 
granted certification to a graduate of any teacher education program. During. 
the period of 1970 to 1975, 26. states revised such a policy and implemented a 
system of "approving" teacher education programs (Pittman, 1975). By far the 
most dramatic action (or at* least the most pu1)11cly. visible one), however, was 
to ^require that graduates of teacher education programs pass a state-sponsored 
test to obtain a license to teach. Between 1977' and 1981, 16 states enacted 
Tegislation or state bbSrtliof education -policy that either changed or initiated- 
tests whose purpose was state licensing of teachers. Table 1 presents a l\st 
of the 16 states making substantive changesT in one or more testing components 
of their certification programs between 1977 and 1981 and describes the 



ERJC - • ' 8 



-5- 



TABLE 1 

Cross-Stite Matrix of Program Elements as of January 1982 
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discrete elenients of these programs within all four phases of the certification 
program. Six of these states have since implemented the changes ^ while, .in.- 
seven others the changes become effective this year; in the other three , states ' 
refinement and planning are in progress for implementation in. the riext few 
years . % 

The program elements depicted in Table 1 represent increased rigor in the 
entire teacher certification process. The move toward more widespread adoption 
of the approved-program model reflected the -Hiiposition of morie stringent 
requirements in an effort to upgrade programs and to improve the quality of the 
professionals they graduated. All of the 16 states represented now have an 
approved-program requirement. Notei<orthy activity is occurring in at least 
two other states. The New Jersey State Board of Education is considering 
imposing more stringent requirements on the curricula o^^achers' colleges, 
and Connecticut is involved in related deliberations on ways to improve teacher 
education. 

Nature of Testing-related Changes '' '■' 



More significant for present purposes are those requirements that involve 
changes in testing practices:- (a) testing of prospective .prograrn^ntrants. 
and (b) testing of "program graduates as eligible and prospective license 
horders. An^ example of the- former -is Alabama.'s newlylnstal led English^ Language 
Proficiency T|st, which assesses basic skills in reading, writing, language 
skills, and listening. It is the installation of tests such as this one that 
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reveal a 'heightened emphasis ,on "the basics" in the screening of prospective 
teachers. This trend is mirrored in end-of -program testing. An increasing 
number of states are including a basic skills test as one component ofMnitial 
certification requirements; Florida's new program-is a prime example. ^ 

That in more and more states graduates must pass a state-mandated test 
over and above fulfilling all other course" and' program requirementsHs itself 
evidence of increased strifrgencV ip^ certification programs*. This evidence is 
less compelling, however, than the changing character of the tests being 
used.__As_Jjib.Te 1 indicateTr^ithe most common tests in use are a nationally 
standardized norm-referenced test (the, National Teacher, Examination— the NTE), 
a locally validated norm-referencea test (the NTE subjected to a within-state 
validation process), and a customized criterion-referenced test (CRT). It is. 
only in the last several years that the latter fiftTs have come into common use 

* — ^ r-^ 

for' end-bf -program testing: This trend has been concomitant with increasing 
specificity in the description of the skills and knowledge which entering 
N teachers should possess, specificity character ist id of objective-referenced 
assessment. 

Another significant feature of the change in initial certification testing 
is an i/icreased emphasis on content-oriented t^sts. While some/States have 

"^aditionally used the NTE Specialty Area Examinations, more and more states 
are funding the development of criterion-referenced .Jests in these and other 
areas. South Carolinals recent legislation,- for example, called for customized 
development of CRT in- eight teaching areas not cov,ered by the NTE (including 

- Trades and Industries^ Distributive Education',. German, Latin, Earth Science," 
Psychology, Speech and Drama, and Health). Georgia now has a total of 18 
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teaching field CRTs assessing prospective teachers' knowledge of the content 
to bktaugh-t-to- students^ in. a variety of subject areas (including Agriculture/ 
MusicAEarly Childhood, Middle Childhood, Corrftunitfative Arts,; Business, Home 
Economfds, Industrial Afts, French, and Spanish), with a nineteenth field 
(Health) Currently under development. Oklahoma's neW pt-ogram being installed 
this year- Ys~by far the most extensively CRT based with 62 separate' teaching 
area tests, \including Journalism, Driver and Safety Education, an umbrella and 
subarea exanis in Science (e.g, '^Zoplogy), Social Studies (a.g., Economics,- 
oVlahoma HiAory), Business Education (e.g.. Accounting, Shorthand), and 
Language Arts '(e.g.. World Literature). ' 

Even the foregoing recitation, however, unBerpl ays' the range of content 
areas being assessed by CRTs.' Special education is also receiving corfsid- 
erable attention.. South Carolina has four separate special education area 
exams, Georgia has three^ and 'Oklahoma has_jeven _[counting ^the umbrella 
exam). There are also tests,' for other pupil personnel service positions. 
Oklahoma has seven: Psychologis,t, School Counselor^. Speech Pathology, 
Psychometrist, Reading Specialist, Audiovisual Specialist, ana Librarian. 
South Carolina has one (Speech Correction) and Georgia has two (Library Media 
and School Counselor) with three others, currently under development. - (There 
are also- CRT certification exams for Administrators -- Georgia's Administra- 
tion and Supervision test- and Oklahoma's three separate tests for sqperinten- 
dents,. elementary and secondary school principals, respectively.) 

The devejopment and installation of these tests are strong indications of 
the increasing emphasis on content area, (subject-matter-)^ tests and the 
increasing adoption of criterion-referenced approaches to measurement. Other 



changes have come hand-in-hand with these. ^ The developmental process for 
teacher certification tests has been increasingly characterized by a strong 
'validation-effort. Examples are the local validation process to which the 
standardized NTE is being subjected in some states (see "VNTE" states on Table 
1) and the full-scale analyses which, as an early step in the development 
of CRTs, serve to identify- the knowledge and skills viewed- by job incumbents 
(teachers of the specific subject matter) as frequently used and important in 
their work (e.g., in Georgia and Oklahoma). 

There is little doubt that recent developments in the nature and types of 
tests in use represent a significant change in teacher certification policy. 
These trends, however,* did not develop in a vacuum. They have their sources 
ini or at least were influenced by, factors in three other areas: the general 
political environment, the legal/regulatory environment, and the education/ 
measurement environment. Each of these areas is analyzed in 'the following 
sections in an attempt to unsort and identify factors postulated to bear a 
relationship to the changing nature of teacher certification isrograms. 

The General Political Environment 

The concept of "political environment" is here intended to denote the set 
of factors which, when taken together, constitute the sociopsychological and' 
socioeconomic fabric of our collective lives, Thus, we cffstinguish from all 
other factors those which appear to.be out of the purview or control of any 
single .individual, group, agency, or institution. The ijid,icators\ of the 
general political environment are readily JJerceF^^^^ past 
decade, one of the most obvious was an alarmingly pervasive dissatisfaction 




with the outcomes of public education. This dissatisfaction, voiced and also 
fueled by the Mtional media, included educatprs' frustration with a ten-year 

\ decline in/^T scores, parents' reports of functionally illiterate high school 
graduates; and business leaders' complaints abiut^he lack of even minimally 
qualified entrants Into the work force. 

In the early 1970Si parents and~^Sthe»^ritics alike began demanding a 
"return to basics" as a means of assuring the accountability of local school 
systems. "Accountability" itself became a byword, if not a" bona fide mover 
ment, and it targeted all tangible features and products of the schools. 
First,, the spotlight was turned^ to students themselves; public pressure led 
legislatures and state departments of education, through the 1|70s, to insti- 
tute minimum competency test programs. These programs, while diverse in 
design, had the common purpose of reflecting the school systems' success or 
failii^ at teaching certain predefined "basics" to each and every student. 
These programs imposed consequences on students for e failure to perform at or 
above "minimally acceptable" levels. ' 

An equally harsh light was cast on school curricula, including a' look not 
only at the traditional "Three Rs", but at social studies, science, and a host 
of other subjects. Public pressure was exerted to increase the utility of 
what wa^ taught to students,, a continuation of the demand for relevance heard 
earlier in the 1960s. In response, educators began modifying curricula in 

• form and/or substance, to f(x:us on skills and knowledge useful to students in 
their economic, . political, and social lives. The emphasis moved from what 
students should kno* to what students should^ be able to do, the latter being 
more observable and therefore more productive .of answers to questions of 
accountability. 



Throughout the tlecade, th(»*mass media and popular press devoted substantial 
coverage to the "crisis in educatfon" and the ability of the system to educate 
the nation's youth. It should have come as no surprise, then, that the focus 
broadened from an examination of the curriculum and student outcomes to include 
an appraisal of the agents of instruction: teachers, themselves. From books 
such as Morris Kline's Why Johnny Can't Read (Kline, 1973) to a New York Times ^' 
editorial (Montgomery, 1979) to a co.ver story for Time Magazine (Help! Teacher 
Can't Teach; 1980), the competence and ability of those who teach came under 
increasing attack. The public demanded assurances that teachers were qualified 
to do their jobs— to such an extent that it was estimated that the teacher 
testing movement, the most visible of all certification-related Activities, 
was supported by 85X of U.S. adults (Foote, 1980). 

It is noteworthy that the 1970s were characterized by these -demands for 
accountability. The underlying factor might be isolated as a coimion pre- 
occupation with economic pressures. The decade was beset by rapid inflation 
and- diminishing resources which resulted in turning the public's attention 
away from perceived "luxuries" in education and spawned this new "back to 
basics" movement. It could be argued that, as an extension of concern about 
personal budgetary constraints, the consumers of education were asking (and 
continue to ask) what value they were getting for their education tax- dollars. 
In the face of strong countervailing efforts by teachers' unions to "protect" 
incuntent teachers, the states' response to these consumer demands focused 
heavily on credentialing of prospective teachers. In many cases, the response 
was the very visible one of expanding and strengthening the initial certifi- 
cation testing components of their programs. 



The Legal /Regulatory Environitient 

■ ^ - 

0 

As public pressure was brought to' bear on teacher certification programs, 
a number of legal and regulatory precedents were being set which influenced, 
the direction of the movement. These were an outgrowth of Title 'VII of the 
Civil Rights Act and the Equal Employment Opportunity Commission (EEOC) 
Guidelines on Employee, Selection Procedures. Additionally, there was the 
influence exerted^ by development of the 1974 version of trte Standards for 
Educational and Psychological" Tests (APA, AERA, NCME. 1974). The promulgation 
of these regulations and standards reflected increasing legislative, judicial, 
and professional concern with fair employment practices both in and out of 
education. 

Legislation, regulations, and the Courts . Stated simply. Title VII of the 
Civil Rights Act of 1964 outlawed employment discrimination on the basis .of 
spx,; race, color, religion, or natipnal origin and empowered the EEOC to 
enforce the stipulations of the law. The 1970 EEOC Guidelines, a revision of 
the first version published in 1966, included a set of stipulations founded on 
the premise that standardization and proper validation in employee selection 
procedures would build a foundation for the nondiscriminatory personnel 
practices required by Title VII. These stipulations (EEOC, 1970) included the 
following: ^ • 



(a) empirical data should be made, availably to establish the predictive 
validity of a test, that is.^the significant correlation of test per- 
formance with job-relevant work behaviors; such data must be 
collected according to generally accepted procedures for establishing 
criteribnr related vTTidity; ^ 

(b) where predictive validity is "not fe■asible^ evjidence ,of content 
validity .(in the case of job knowledge or proficiency tests) may suf- 
fice as long as appropriate information relating test content to job 
requirements is supplied; 

(e) where validity cannot otherwise bje established, evidence of a test's 
validity can be claimed on the basis of validation in other organ- 
izations as long as the jobs are shown to be comparable and there are 
no major differences in context or samp^le composition; 

(d) differential failure rates (with 'consequent adverse effects on 
hiring) for membevs of groups protected by Title VII constitute dis- 
crimination unless the, Jest has- proven valid (as defined above) and 
alternative procedures for selection are not available; and 

(e) ' differential failure rates must have a Job-re'levant basis and,- where 

possible, data on such rates must be repop'ted separately for minority 
and nonminority groups. «• 
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As a result of Title VH and the EEOC Guidelines, many concepts which had 
previously been the purview of psychometricians took on important legal rami- 
fications. In the first major challenge to enployment tests ( Griggs v. Duke 
Power fcompany. 1971), the Supreme Court unanimously interpreted Title VII as 
prohibiting ''not only overt 'discrimination but also practices that are fair in 
form, but^iscriminatory in operation" (p, 431). This decision decreed that , 
absence of intent to discriminate was insufficient to justify the use of a 
test which had a disproportionate impact on protected minoritiejs; • even the 
employer with the best of intentions bore the responsibility of d^nstrating 
"that any given requirement. . .(.bears) a manifest relationship to the 
anployment in question" (p. 431). The Court further commented that the tenets 
of the Guidelines we "entitled to great deference" (p. 434) becau^^e they 

0 

were drafted by the enforcing agency for Title VII. It was in this way that 
the concepts of "job relatedness" came to be incorporated ihto the law of --^ 
employment testing (Bersoff, 1981) and virtually came to have the effect of 
law' (Rebel 1, 1976). , ; 

Two other early cases are worthy of note. In Chance v. Board of Examiners 
(1972), the New York licensing exams for principals and other administrators 
were declared 'nvalid for lack of job relevance. Later, in Albemarle Paper 
Company v. Moody (1975), the Court invoked EEOC "and, in effect, established 
criteria to be used in proving whether employers' tests were job related. 
Specifically, the Court made reference to thp importance of analyzing "the 
attributes of, or the particular skills needed in," (p. 432) a given job as a 
"basis for creating a"" job-relevant test. 



is • 



Most significantly for teacher certification programs was passage of a 
1972 amendment (Public Law 92-261) to the Civil Rights Act which struck out 
the exemption for educational personnel in public institutions, extending the 
provisions of EEOC beyond private industry to state and local government 
agencies. Prior to the amendment, court challenges against public employers 
(e.g.. Chance v. Board of Examiners ) were initially brought on. equal pro- 
tection grounds under- the Fourteenth Amendment which required only that em- 
ployers demonstrate a rational basis for use of a test. Arguments only in- 
directly cited, but amassed consensual support for, EEOC Guidelines which were 
not technically binding at the time (Rebell, 1976). The 1972 Amendment paved 
the way for later litigation (e.g., 'United States v. State of North Carolina. 
1975) which successfully challenged the NTE as a teacher selection test. For 
an excellent review of 'these cases and an overview, of the law and teacher cer- 
tification, see Licensing and Accreditation in Education ;' The Law and the 
State Interest (Levitov, 1976). 

Throughout the decade, the concepts contained in the 1970 EEOC Guidelines 
were refined through the process of litigation and resulting Court opinion. 
Concurrently, various federal agencies were debating related issues, a debate 
which culminated in publication of the 1978 Uniform Guidelines (EEOC, CSC, 
Department of-Labor, and Department of Justice, 1978), a document which con- 
tained "speciiric statements In most sections, in contrast to the more general 
statements of the 1970 Guidelines" (Novick, 1981, p. 1040). The intent was 
,made.clearf that a test muSt be a rejjresentative measure of the actual domain 
of skills used on the job and must be validated for its intended purpose. 
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Professional standards . A discussion of the regulatory environment 
affecting teacher certification testing cannot exclude the process whereby 
professionals and practitioners regulate themselves. An exanple of this 
self-regulation is reflected in the publication of the Standards for 
Educational and Psychological Tests (APA, AERA, NCME, 1974). Unlike ea.rlier 
documents of its kind which stressed the obligations of test producer^, the 
1974 Standards addressed competency in^ testing practice and test use (Iwkvick, 
1981). Novick (1981) presents an excellent review of the evolution in pro- 
fessional standards over the last three-quarters of a century, 'but most re- 
vealing, is his cpnment that this, first document on test use "might not have 
happened, had it not been for the emergence of the. social questions to which 
the EEOC Guidelines clearly^ responded, and the concomitant civil rights 
pressure of numerous advocacy groups" (~p..^043J. 

The Standards display many similarities to the EEO(>-Guldeljnes and, in 
fact, both the 1974 document and its 1966 precursor were 'cited in numeroirs- 
court cases (e.g., Albemarle ) to bolster the credibility and importance of the 
Guidelines themselves (Bersoff, 1981). Beyond the emphasis on ■ validation 
strategies, however,' the Standards stressed the requirement to investigate 
potential bias in the measures and' to report results for separate* subsamples 
i.e., minority groups). Further, the Standards specified that any pass-fail 
scores used should be accompanied by "a rationale, justification, or eRpla- 
nation" (p. 66) for theic adoption. • It was provisions such a^' these which 
were taken, seriously by the designers and implemente>s of the newer teacher 
' certification program. ^ 
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The combined impact . Taken together, Title VII, the gEOC Guidelines, 
resulting court challenges, and' the StandaMs~c^m'-fae--seen— as-- ^a1y s ts_,,flnd . 
guides to the restructuring of teacher certification "programs. Their impact 
is evidenced in several aspects of these programs: 7' 



V 
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(a) Because it has not been feasible\to ^conduct predictive validity 
studies (based primarily on difficulties in -obtainii:ig reliabk- and 
valid measures of the criterion), the response has been to more fully 
incorporate other valiclation efforts. Increased attention is being 
■paid tcr the validity of certification tests, and it is. focused almost 
exclusively on content validity. » ^ . 

• ■ . • ■ 

(b) The focus on content validity has greatly expanded the in^^'lyement of 
encumbent teachers and subject-matter specialists in .the test devel- 
opment process, both through committee review work and participation 

^ in full-scale job analyses. It is through these methods that the 

test development process attends to the specific ^attributes of a job 
and^ovides evidence of the test's relevance to the job to which it 
-applies. 



(c)^ There is increased awareness of the potential, for differential 
impabt^with expanded efforts to include diverse interest groups in 
the test^development process and to report test results separately 
.for relevant^ nor ity groups. 




'((J) Finally, there has a been a shift toward the use. of criterion- 
— -referenced , as opposed to norm-referenced, models of standard 
setting; a* variety of methods incorporating expert judgments about 
the test items themselves are coming into more popular use. 

9 

These trends reflect the significant impact of the legal /regulatory environ- 
ment on the design of teacher certification programs. 

The Education/Measurement Environment . 

It is in this final context, the education/measurement environment, that 
discussion focuses on factors within the purview of educators and psycho- 
metricians, rather than on factors external to the domain of education. Two 
disttnct themes are to be examined:., (a) theory development in relation to 
teacher education practice, specifically the growth of competency-based 
teacher education '(C8TE), and (b) advances in measurement theory and statis- 
tical techniques relevant to criterion-referenced tests. 

CBTE. The early 1970s saw the start of the C8TE movement, a newly 
conceived pedagogy for teacher education , programs based initially on the 
already-established concept of mastery learning. Among 14 defining, and 
ancillary features of CBTE, Hall & Houston (1981) included six which bear at 
least a surface relationship to the characteristics of the newer teacher cer- 
tification programs: 

(a) instruction focused on learner outcomes rather than on < time in 
attendance; 



(b) a priori description of the intended learner outcomes; 



(c) introduction of subcompetency and competency statements; ^ 

(d) . emphasis on mastery, at least .tQ. some ' rainimum level of identified 

learning; . ■ ' . 

' (e) de-emphasis on hovt vfell a student performs relative to other students 
in favor of emphasis on demonstration of desired outcomes; and 



(f) clear and public communication of minimum levels of ^success with 
continual feedback on performance. 

Even in its early days, there^were optimistic predictions that CBTE would 
result in "new measures of teacher behavior" and "new criteria for cei^Aifq- ^ 
cation" (Hal-1 & Houston, 1981, p. 20). It was the basic tenet thist instruc- 
tion be objective based which was most influential. In the spread of CBTE to 
teacher training institutions, the pedagogy was rarely ful'ly understood -^or 
fully adopted,. but even where it was only su perficially incorporated into -an 
ongoing program, it included a focus on establishing o.bjectives for learning. 
The debate surrounding CBTE therefore included in-depth examination and^ 
discussion of which skills and competencies teachers, needed to develop. One 
chief product of this debate was- the development of performance-based 
standards against which teacher competency could be judged. It was thus that 
CBTE provided the testing movement with -;tfre— crILerid " iiecessdiy Lu develop 
clear, valid, job-relevant certification tests. 
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. Tests, measures, and statistics . As. CBTE provided the c'rite)^ia to be 
'measured (or at least fashioned the willingness to do' so), it devolved to the 
*'^'j™easu''e"'€nt community to respond with *^ppropri ate tools an'd instruments, 
became clear that the existing standardized norm-referenced . tests could not 
fulfill "the '(^and for content validity and- tailor:ed- job relevance, for speci- 
fication of objectives, or for scorvns.An''compdirisor\ to preset criteria rather 
than in terms of group normsi ' Thus, the" rapid growth in the demand for and 
use of criterion- referenced tests went hand- in-hand with the CBTE movement. 

While it ts. beyond the scope of this paper to provide technical details, 
* • " »^ 

it is clear that the growth ot CBTE- ' and . CRT-supported — C^mI, — in turn. . 



continued "to be supported by) research and development of new measurement 
techniques. We have witnessed refinements in methods of defining domains 
(Popham, . 1980) and generating statements of fearning objectives (Popham, 
1978), strategies for developing test items (Hambleton & Eignor, in press), 
a'nd methods of setting cut scores (Nassif, 1978; Hambleton,, 1980). There have 
also been significant advances in CRT-relevant statistics, including indices 
of reliability (Subkoviak, 1980), application of latent trait models (Cook, 
Eignor & Hutten, 1979), new approaches to i,tem analysis (Berk, 1980), and new 
methods of investigating test item bias (Merz & Grossen, 1979). These 
technical developments went a long way toward enabling increased rigor in 
■criterioo^referenced testing conducted for public policy reasons. And, given 
^.^JtM^jMerXace J)f T^^^^ t^^e need for a stringent, 

.fair, and legally defensible system for certifying teachers fueled support for 
continifing' technical refinements. « ^ 
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Summary an dyConclusions 

\, 

Early in this- paper, the authors suggested a "bandwagon" effect in the 
Increasing adoption of CRT-based teacher certification programs. In doing so, 
the intent was )\ot to suggest -automatically, that "the band is {^laying the 
right tune," althbugh the many CRT supportei*s in the professional coinnunity 
would like to think so. Yet, it can be argued that the. recent trends tbward 
i/)creas1ng rigor in the teacher certification process is associated with\a 
variety of. positive effects: 

(a) The visible nature of the change has increase^ the involyement of 
educators and special interest groups in debate over what teachers 
' should know. This debate helps , to fend off potential complacency, 
that might thwart growth in our knowledge base about the constitutive 
elements of effective teaching. \ > 



(b) The movement has substantially increased communication about what the 
tests measure, 'a trend which served to enhance the meaningfulness of 

test scores. This may be contrasted with the traditional scor'Oig of 

• \ \ 
NRTs which diverted attention away\from test conterjt in favor of 

person-to-group comparisons. , "* \ *" 



(c) The objectives-based construction of the tests enables test takers to 



learn, in advance^ the expectations set for them, a condition which 
most recent research suggests contributes to maximizing performance. 

. \ • - ■ \ ' 
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Completing the conmuni cation cycle; the newer ceritrf ication programs 



entail expanded feedback examinees on their performance, including 

IS a\d 

' V 



indications of strengths aVid weaknesses with regard -^o* specified 



domains on the tests. 



The objectives-based approach has \^lso increased the utility -of feed,* 
back to institutions about the performance\pf ^^their graduated. The 
optimists- among, us (Hall & Houston Al981) anticipate 'that, once the 
competency tests are installed, "tekher education programs \will 
start preparing their students to a sirfficient level of jjyistery-, of 
each test criterion" (p. 25). In essence this Would constitute i^he 
upgrading of teacher education programs \that was the initial in- 
tention of CBTE.' 



id 
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(f) The new legal implications have heightened- the focus on incorporating 

vthevmost state-of-the-art techniques in the measurement of the compe- 

\ • ■ * \ ■ . .. 

tenci,es of prospective teachers. The increased attention to technical 

rigor ^can only serve to further protect the test takers. 

• \ , ■ • \ 

(g) Lastly, ^the visibility of these^ developments has kurned a spotlight 
on the ' ^importancr of the role of the pub-Tic ^chodl teather in 
American \societi. The controversies surrounding Ueacher certifi- 
cation testing hive increased the outreach efforts pf state departr 
ments of education to explain (or justify) their policies and 
pra^ctices. These Efforts have, *at a minimum, increased information 
, sharing and the p^-blict's awareness of state ^ efforts to fulfill 

accountability cimanJ^. 

■ 26. 
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Notwithstanding these positive effects, there |re several implications of 
the testing movement ^ich deserve serious -study. Jhe first is a concern 
about the iirmediate teacher supply. With more stringent criteria for certi- 
fication, fewer prospective teachers are likely to receive licenses, and 
school . systems are like.ly to find" it incrSasingly difficult to staff certain 

■ positions. Even under thiT" hopeful assumption that, in response, teaching 
institutions will upgrade the skills of their graduates, there is little doubt 
that a significant time lapse will exists In the meantime, state departments 
of educjittbh are likely to experience substantial pressure to implement 
politically expedient solutions to this problem.^ 

Second, the reporting of test results for examinees on an institutiori-by- 
institution basis has already' begun to engender political pressure- to' "reward 
or punish" institutions" on the basis of their '^performance." Where failure 
rate's are excessive, for exanple, threats ' of loss of accreditation are not 
likely t^be uncomnon. In the ^ face of these pressures, it wijl be 

' increasJngly difficult to ward off iviispllstic solution^ to cqmplax^problems. 
Third* and finally, "the 'differential passing rates' being observed for 
minority groups have direct .implications for the proportion of minority group 
teachers in the nation's schools, While the" testing programs and the tests 
themselves may be held to be" valid on the basis of evidence they present, the 
implications of their use must.be consi'dered from the -larger cultural and 
sociological perspective. Issues such as these are raised as, caveats to the, 

- testing practitioner-, in .the Interest. of .emphasizing. thaV-.te»tlng. Jor _p_ubll^ 
"' pohcy purposes musf 'Be conceived ana implemented in a manner that lb both— 
professionally and socially responsible. 



References 



Albemarle Paper Co. v. Moody. 95 S CT. 2362 (1975). 

American Psychological Association, American Educational Association, and 
. National Council on Measurement in Education. Standards for educational 
''and psychological tests and manuals . Washington, D.C.: American 
Psychological Association,) 19/4. 



Berk, R. Criterion-referenced measurement; The state of the art . Baltimore: 
Johns Hopkins Universky Press, 1980. i ^ ^ 

' ^ » - ' ^ 

Bersoff , D. Testing and the Law. American Psychologist . 19S}cT~^f i047-105|5. 

Chance v. Board of Examiners . F. Supp. 203 (S.D. N.Y., 1971), Aff'd 458"F.2d 
1167 (20 Cir., 1572). . 

Cook, L. JL., Eignor, D. R., and Hutteh, 1. R. Consider^ations in the 

VpHcation of latent trait' theory _to objectlvfs-based criterion-refer- 
enced tests.. Laboratory of psychometnc and evaluative research report . 
Amherst;, .MA:, University of Massachusetts, School of Education, 1979. 

• ' '■ " '„ 

'Equal Employment Opportunity Commission, Civil Service Commtssion, Department 
of labors & Department of Justice. Adoption by four agencies of uniform 
guidelfnes' OR' employsr selection procedures. Federal Register . 1978, 43, 
38290-38315. > * 

Griggs v. Duke Power Company . 401 U.S. 424J1971).' 

Hall, G., Houston, R. - Gompetency-^based teacher education: Where is it now? 
New York Un1vers1ty|i|uarterly . 1981, 3i '?0-28. 



""HammetonT'RT'K" Tesr score yalidity and standard-setting methods. 

In R. D. Berk (ed.) Criterloo^fecenced measiireroent: The State jf the 

art. Baltimore, MD: Johns Hopkins University Press, 1980. 



\28 



-25- 



Hatnbleton, R., ^ign6r, D. A practitioner's guide to c^iterion-refer^^^^^^^^ 

test developrent, validation, and usage. Laboratory of psychometric and 
p"l native research report No. 70. Amherst, "MA: University oV 
Massachusetts, School of Education, 1979 (2nd ed.) 

Help! teacher can't teach! Time Magazine . June 16, 1980. 

Kline, M. Whv Johnny can't read .- New York:, St. Martin's Press, 1973. 

Levitov, B. Licensing and accreditation in educatio n; The law and the state 
interest . Lincoln, Nebraska; university oT weorasKa, ^976. 

Merz, W. R., Grossen, N.E. An emoiHcal i>ivestigation o^^ '^■"'^^^"^L-jS'oOC:) 
examining tgst item -QTas (Una! report grant mt ' ^''^'^^n • 
Sacramento, CA: Foundation of California University, Sacramento, CA, w/y. 

Montgomery, J. Can "teach" teach? New York Times . May 14, 1979. 

Nassif P.M. stan dard-setting for criterion-re ferenced teacher licensing 

tests. Pi ^er presented at the annual mee ting of the National .Council on 
Heasurement in EducStion; Toronto, March, 1978. 

Novick, M. Federal guidelines and professional standards. American 
Psychologist , 1981i 36, 1036-1047. - 

* * I 

Pittman. J. Actions taken by state departments of education in developing 
CatE certification systems. Papet delivered at. the Association of 
Teacher Educators. Annual Conference, New Orleans, February, 1975^ 

Popham, -J. Criterion-referenced measurement .. Englewood Cliffs, New Jersey; 
Prentice-Hall, 1978. 

M. The Law, the courts, and tether credential ing reform. • 
B Levitov (Ed.) Licensing and a ccreditation n education;^ Jhelaw 
the st a tLtnterek .iaincoJa>JJehias^ 1976. 



Rebel!, 
In 

— ^ — ^ — and 



ERIC - — - 



0 



Shimberg, B. Testing for licensing and certification. American -Psychologist, 
1981, 36, 1138-1146. * * > ■' ' 



. Subkovlak, M. 0.'-*'"Decision-'consistency_ approaches. In R. D. Berk (Ed.) 

Criteribh-referenced measuremgrrt; The state of the art . Baltimore, w. 
Johns Hopkins University Press, 1980. 

U.S. Department of Health, Education, and Welfare, Public Health Services. / 
Credentialing health manpower (DHEW Publication No. 1.05] 77-50057). 
Washington, D.C.: Author, July, 1977. 

United States" Equal Employment Opportunity Commission, Guide! ines in-^ employee 
* selection procedures . Washington, D.C.J Aug. 24, 1966 (29^C^.R.: 1607). 

United States v. State of North Carolina Civil No. 4476 .(E.D.N., CAR., 1975). 



COmn thEMES IN. TEACHER CERTIFICATION TESTING 
PROGRAM DEV£LOPMENT AND IMPLEMENTATION 



Katherine E. Vorwerk 
Will 1am Phillip Gorth 



National Evaluation Systems, Inc. 
30 Gatehouse Road 
Amherst, MA 01004 



A Paper Presented at the Annual Meeting 
of the National Council on Measurement 
in Education, New York, 1982 



31 



Overview 

Programs of teacher cpiipetency training and^ teacher competency testing, 
prior to certification are not new. Yet there still existS: confusion about 
the intended purpose and outcomes of such' programs, particularly teacher 
certification testing programs. For example, it has been said that teacher 
certification testing programs: 

• win either improve the quality of education or lower the teaching 
profession's standards because of their emphasis on minimal 
knowledge; 

• will either serve to define what a good teacher is or end up being 
nothing more than a "search for victims'' and" a "hollow means' of 
judging the- efficacy of teachers" (Cole, 1979); and 

. • will either test for content that is unrelated to successful 
teaching or test for content that is an absolute necessity. 

As is true of all occupational licensing laws, the primary purpose of 
teacher certification laws and their testing component is to "protect the 
public health, safety, and welfare" by ensuring that only individuals who are 
competent iii 9> subject are allowed to teach it. Yes, certification testing 
programs do ig most 'cases emphasize minimum content -knowledge; yes, they can , 
result in improvement in the quality of education; yes, they may end up being 
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part of a definition of what a good teacher is and what content knowledge is 
absolutely necessary, etc. But these are secondary outcomes of such programs. 
The primary outcome, which every program is designed to achieve, is the 

♦ 

protection of the public from inco(npetence. 

Ths public is clearly concerned about teacher competence. For example, 
in a recent Gallup Poll, 95X of those polled agreed'that teachers should be , 
requested to pass exams in their subject areas (Cole, 1979). Teacher 
incompetence .is frequently used by parents and legislators as. a partial 
.explanation for the' decline in students' test scores that we. have witnessed 
over the past 15 years. Moreover, the large number of ."states that require or 
soon will require a teacher certification testing program (approximately 15)^, 
or are considering doing so, is further testimony to -the fact that the public 
wants its' children protected from incompetent teachers. 

A systematically developed teacher certification program can potentially 
prevent individuals who lack competence in critical subjects from entering the 
teaching profession. This paper presents a general model for developing the- 
testing component of a certification program. The model's structure will be 
described and the key issues associated with each component of the model will 
be presented. 

It should be pointed out that the model applies only to the .formal testing 
component of a teacher certification program such as a struct^red observation 
sessfon or a paper-and-penc 11 content test. It does not apply to other parts 
of a certification. program such as course requirementi or student teaching. 
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DeveloDment Model * 






* . The model consists of five components (see Figure 1) which, are: 






(1) Developing Certification. Requirements; 

(2) Deciding How to Assess Requirements; 

(3) Defining Measurement Strategies and Instruments; 

(4) Handling Logistical Issues of Assessment; and 

(5) Cdhmuni eating and Using Assessment Results. 
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' Figure 1 

TEACHER. CERTIFICATION TESTING 
PROCESS MODEL 
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The- components are roughly sequential, although several of the steps overlap. 
Each component will be discussedv^parately in the sections that foJ low. 

Deve^loping Certification Requirements . Requirements mandating teacher 
certification testing-.genera=lly come either from state boards of educatjion or 
'state legislatures. For example, the '-authority for Alabama's testing progrem 
comes from the State Board of -Education, while that for Flprida comes from the 
legislature" Occasionally, the effort may be a joint one.. After requirements 
are developed, they are generally passed on to the state's department of 
education for further definitibn and 'implementation. -Ideally, .the depanmerit ' , 
of education provided , input during* the devel.9pment of the mandated 

* • " * ^ 

requirements, and therefotV. ,is at least familiar 'wit^) their corftent. - 

. Other ' constituencies that should be represented, "in 'developing 
certification . requirements are the state ' s ' teacher trajning- ins'titu'tions, 
teachers, and the- genepal jpublic! Each of. these constituencies^ _will be 
affected by the requirements. T+ieir, input during .^the in-itial stages of 
defining the -requirements wi'll help to ensure that the requirements are^both 
workable and accej)table. . « • 

At this stage some states conta^ other states' departments of education 
for advice and background on their certWation programs, or they engage the 
services of one or more testing consuitants-^ho also can provide information 
abouf existing certification testinj^ prograr^s as well as psychometric 
cbhsultatibh. 
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Deciding How. to Assess Requirements . Generally, the state's department 
of education is''" responsible for deciding how the requirements will be 
assessed. Requirements, of course, vary widely. Some merely specify the use 
-of aiparticular test or series of tests; e.g., Arkansas's regulations "require 
persons applying for initial certification to satisfactorily complete 'an 
existing teachers' examination' or other similar examination." In suqh cases, 
the state department mo ves o w to the next componon t~-of>-the model. -In -other 
states the requirements specify a more cornprehensjve and detaile^L testing 
' program that could include entrance exams to teacher .education programs, exit 
■exami; covering basic skills and specific content knowledge, and other 
nonexami nation requirements (e.g., practice teaching and in-service trai ning 
for certified teachers and administrators). 

Deciding how to assess requirements impacts heavily on the measurement 
strategies that will be used as well as on the type of results that will b.e 
produced. For example, deciding to assess teaching skills using some type of 
on-the-jol) observation procedure will. , result in the implementation of a very 
different type of measurement instrument than if a decision is made to assess 
the content knowledge competency of teachers in the subject or subjects they 
aspire to teach/ 

In reaching a decision, several key issues must be considered,. First, 
budgetary constraints must be realistically evaluated. Assessment strategies 
will vary in cost. The cost will be p,aid_p.a!:tjy_by the state , for start-up 
costs and partly by the individual examinee for operating costs. Second, time 
considerations are critical. Often mandated requirements mciuae anTmplemeh- 



tation date. The development of a certification testing program tailored to 
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the state's curriculum requirements involves more time than adopting an 
fex'istina test. If a testing program must be produced within three to five 
months, this will' have an impact on the type of assessment selected. Third . 
and finally, the question ,of which assessment rstrategy will best protect the 
State's" public must be asked. For -example,, should systematic observation of 
teachers on the job and , paper-and-pencil tests of content knowledge both be 
use? or- is one or the other, sufficient? Clearly, some comDinaLiun ^ 
observation, paper-and-pencil tests, and preservice evaluation is preferable, 
but* given titiie and budgetary constraints, is this possible? If not, which 
approach is actually going to meet the needs of the state in the most 
satisfactory manner? . 

Defining Measurement Strategies and Instruments . This component is 
closely tied to-the previous one. Deciding how to assess requirements is, in 
effect, defining measurement strategies. However, the creation of assessment 
instruments involves additional technical work. ^ 

This component is a major one, and it generally consumes most of the 
start-up resources expended on the testing program. In this^ phase the actual 
testing instruments are developed. Professional ancl» legal guidelines for 
tests used for certification purposes apply here and must be clearly 
understood and followed. Major issues covered by these guidelines include the 
need for job relatedness of the measurement' instruments, test validity, test 

O 

reliability, and the passing score or standard that is used. Specific 
technical considerations involved in each of these will be discussed in a 
later symposliBH paper (Nassir 4 tlhot).. ~ '■ Z 
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In addition to adherence to professional and legal guidelines for test 
development and use, "the department of education must take care to involve 
members of the teaching profession--both , actual teachers and teacher 
educdtors—in thie development of assessment instruments. The involvement of 
these individuals is critical to ensure the appropriateness of the test, 
instrument to the state.' Clearly, teaching professionals would be involved in 
a job analysis procedure carried out to establish the job relat.edness of the 
^imtent of a particular examinat;|on. For example, OklaHoma surveyed over 
4,500/ teachers as part of the process of establishing the job relatedness of 
Oklahoma's certification tests. 

/ In addition, teachens • and teacher educators also should be involved in. 
all/ other steps in this, component of the model, particularly if a customized 
pader-and-pencil test or other measurement iflstrument is being developed. For 
extaiple, committees of teachers and te_acher educators should be formed to 
^reL.iew the domain of knowledge/skills to be included^on-th&^st, to review 
tile results of the job analysis procedure (and^t(rmake judgments on how those 
rfesults are to be used) and finally, to review the actual test items appearing 
on the test. Such reviews by teaching professionals are typical 'of the 
testing programs 1n states such as Georgia and Alabama. 

Handling Logistical Issues bf Assessment ; Registering candidates for the 
assessment and actually carrying' out the assessment, the two major parts of 
this component, are< logistically demanding. Depending upon the program, this 
component can vary from two or jthree administrations of one test at three 
sites distributed across a s~tate'"to-a sequence of pr.eservice and in-service 
evaluations coordinated with a test o1^ content knowledge given several times a 
year. 
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It is important at this point to provide teacher certification applicants 
with complete information about the testing program requirements, 
administration procedures, and results reporting. This notification seems 
best a;:ccomplished through a detailed registration bulletin which clearly 
spe^cifies the state's certfficatioh laws ,and reflations antf explain^ to the 
applicant his or her ifesponsibilities and rights during the testing program. 
The bulletins should be widely distributed in the state through the teacher 
ejducation institutions, and the department'' s certj,f,icati6n office^ Other 
Information also may be available. For e^ar^eT^ & test is developed f rogi' a 
sel of content objectives, these conUrrt^.Qbje6tives should be made available 
to students, through libraries or teacher educatton programs, so that they can 

use them to prepare for the tesx.^ 

Registration proceduntfs should be^as simple as possible to avoid mistakes 
and confusion. Information about registration procedures, deadlines, fecfs, 
and testing Joe atlo'ns should be in the registration^buVletin. 
' 'Regardless of how the registration materials are provided to examinees, 
it Is important that they be provided in advance of the ^admini strati ph so that^ 
students have ample. time to peruse them, send in registration forms, alid still 
have time- to change their registration if they 'choose. 
• Test administr ations 'should " fie standardized and secure so that all 

I . * 

applicants have the ^ame opportunity to*perform. Also, administrations should- 
occur several times during a y.ear so that applicants have, ample opportunity to 
sit for the exams', and they should' be spaced proffer ly so that the results of 
one admini strati orr are reported to the candidate before the registration 
de'adliVe ,for the next ailmini strati on. In "Oklahoma, for example,, testing 
sessions are he'ld^f our' times per^ye?r,. and^ students may, take up to eight tests 
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Coflwuni eating and Using Assessment Results > At a minimum^ exami natron 
results should be reported to four constituents. First, results , should, be 
reported to, individual exaininees. - Clearly, those who take the test should 
find, out whether .they passed. But in addition to information on whether they 
^passed or'failed the test, examinees als^b should be provided with diagnostic 
information; i.e., score reports should include information, on how the student 
performed :on each of the major content areas covered by the exam. This 
diagnostic profile of the^ student's strengths ,and weaknesses can serve as a 
springboard for additional growth. For example, a test taken in health and 
physical education might provide feedback to ^the stutefit on how he or she 
performed on questions related to elementary physical education, physics! 
development, and mental health. 

Second, results should be reported to the colleges and universities at 
which the certification applicants received their educations.. These results 
can phovide institutions with two typ6s of useful information: (a) how each 
of their students performed on the test,, and (b) the performance of each of 
their "individual teacher" training^ programs. Content where students- show 
consistent strength or weakness may indicate corresponding areas of strength 
and wealftiess in the tKa'fning programs themselves. Such information can 
stimulate- curriculim modification, and the strengthening of the training 
programs. 

"HMt^ the state sjiould receive results/ Obviously, the state needs this 
;infinnation about individuals * to determine whetKer certification should ber- 
granted or denied. Statewide data also provide information about how the 
total group of siudents,has performed and allow for comparison among subgroups 
within the sample, for example, males vs. females. 



Fourth and TfiMilly, results should- be reported to the public. Results 
demorwtrat^ to the public that only teachers foand to be competent in those 
areas determined to be necessary have been cert|fied, and that their children 
are being , protected from incompetent appli^nts to the teaching profession. 

-That' is the model. As indicated at the beginning of this paper, there is 
time only to cover the model in a very general way. It has not been,- possible;' . 
to discuss many of the irapartaht details involved in the '^^^^'^^f^Qj^y^^ 
implementation of a teacher certification testing "program. Howeyer-^i^^.^MV^'> 

■ ^ , . ' d'''^^^'■ 

\i^^x\ possible to at least mention most of the important issuer ^haiVv->igeil>^\,b^^^ 

considered. • ;-*,<• 

Conclusion , . 

Teacher certification testing programs do not so^lye all of the problems 
of American education, or even the smaller cluster of problems related 
specifically to teacher competency. However, these programs are able to 
identify people who can and cannot demonstrate, in- a relevant testing 
situation, the competencies 'which the state- fe'fe^s they should be able to 
demonstrate. The model presented in this paper provides a general descri^„tion 
of the steps in the' development and implementation of such a teacher^ 
certification testing program, ^and the issues that must be considered. 
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Introduction ' 

.Since the niid-70s, concerns about the quality of teaching in public 
schoQjs have led to significant changes in the requirements for teacher certi- 
fication in programs throughout the country. An increasing number of states 
and school districts are looking at various methods of assessment which may 
help to, improve the- efficacy of the certification process (Harris, 1981i 
Nothern', 1980). Since 1978, five states have substantially .renovated their 
programs to incorporate competency-based,' criterion-referenced tests and per- 
formance assessment for evaluating teachers .seeking initial certification 
(Note 1). Dozens of other states have begun the process of exploring options 
and 'implementing similar changes; still others require candidates for initial 
certification to pass some component{s) of .the National Teacher Examinations 
(NTE). 

A cry no less vocal than the cafl for teacher testing is the protest that 
no examination can adequately measure the skills essential to competent teach- 
ing (NEA, 1982). This perspective seems to posit that most or all teacher 
competencies are intangibles—words that begin with capital letters such as 
Patience and Enthusiasm, While ;it seems fairly apparent that no test of 
multiple-choice questions can suffice as the sole criterion for certification, 
it is also apparent that some form of content-based assessment is essential to 
-ehWe"th~at- candidates at-least-knowr-the- information- they are-supposedt^to- 
impart in the classroom.' Whether or not they can impart it successfully is 
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the subject of later assessment through different procedures. The American 
Federation of Teachers, for one example, supports the use of tests to assess 
the qualifications of candidates for certification, but not for decisions 
related to retention, salary, and tenure (Note 2). 

Thus, the goal should not be to eliminate assessment and leave teacher 
training institutes on their own to maintain standards but to support the 
effort by improving the tests and other assessment methods available to eval- 
uate teacher candidates for certification. This perspective naturally raises 
some significant conceptual issues which must be carefully -considered. 

Conceptual Issues • - 

The first major issue to consider is* when to/ assess teacher candidates. 
Recently developed programs seem to indicate agreement that prospective teach- 
ers should be assessed at at least two of three different stages \(Note 3): 
qualifications for admission to a teacher training program, qualifications 
achieved upon completion of the program, and performance in the classroom. 
A comprehensive program for initial certification would provide assessment , of 
teacher candidates at all three of these stages. 

The second major issue is how to assess teacher candidates at each 
stage. On tlvis issue alternatives abound, but agreement founders. 

— tesum'ing^tfiaT' assessmenr"^urs ar the three po'ints mentioned, the third- 
major issue to consider is how to conduct the- assessment in a manner that is 
technically and legally defensible. According to federal employment guide- 

i 

'.lines, which also affect .certification procedures, any instrument used for 
licensing or selection must be a representative measure of the actual domain 



of skills used on the job. It must also be able to be validated for its 
actual or intended purpose (EEOC, 1978). In addition, state and local laws 
which apply specif ica^lly to certain programs or aspects of teacher certifica- 
tion must be heeded judiciously. In njany cases state legislation or a board 
of education has provided the impetus for developing and. imp^lementing a teacher 
certification program. .For example, state laws requiring competency tests have 
been passed in Florida, Oklahoma, South Carolina, and Texas; board of education 
mandates have been established in Alabama, Georgia, and New York. 

The purpose of this paper is to explore various approaches to assessmeBt 
for initial teacher certification. Conceptual issues and the relative merits 
of each approach currently available are considered in relation to test design, 
assessment for entry to a teacher education program, exit credential ing, and 
classroom performance assessment. - , 

Assessment Design 

Certification Areas 

■5 

To a large extent, the first step iii designing^ assessment Instruments 
depends upon the structure of the state's certification program, i.e., the 
definition of certification .areas. While one state may certify a teacher only 
in a._.genera L area_ calj ed_ _Soc ial Studies, for example, another may cert i fy 
teachers according to specialty: History, Political Science, Economics, and 
so on. The definition of these areas will influ%ntfe the number and type of 
assessment Irvstruments ' required.. The first state would require only one 
general content-based test for the Social Studies certificate; the second 



would have to develop an umbrella test for Social Studies and/or a discrete 
test for each of 6-8 specialty areas. The major reason for this is to ensure 
that each candidate is only responsible for content essential to his or her 
field; e.g., a person who would be. certified oniy^to teach Economics should 
not be required to pass a test that includes U.S. History and. Geography. 

" The definition^ of tests measuring a specific. array of certification areas 
usually precedes, but often depends on, determining^ what to measure within 
each test. One important fact to keep in mind: tests should be developed or 
adapted to certification areas— not the other way around— in order to maintain 

the integrity of the state's own program design. 

— ■* . ' . 

Domain- Definition 

< 

Determining what^ to test for admission, for initial certification, and for 
classroom performance' assessment involves- defining domains of knowledge and 
skills for each assessment area. Assessing qualifications for admission to a 
teacher education program may involve an evaluation of the student's academic 
records or a test of basic skills, literacy, and communication. Exit require- 
ments may involve another evaluation of the candidate's credentials, a test of 
content knowledge in a chosen teaching field, a test of pedagogy, or alterna- 
tive assessments of various performance skills. Evaluating performance in the 

A** 

-c-1 as sroom- may J n voIv.e^.any_o1LaJ.ai;.ge_ii.ujit>eji 

In the process of designing a comprehensive assessment program, the task 

of 'determining what to assess must precede or occur at the same time as choos- 
.ing assessment methods. Basically, there are' two Ways of determining what to 

assess. 
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One method is to idefitify the knowledge and skills taught at the college 
level. For example, a candidate for teacher certification could be tested on 
his- or her knowledge of the curriculum required by the teacher education pro- 
gram. This" can be a legally defensible method (Note 4), according to the 
notions of "curriculum" and "instructional" validity, and it seems fair to the 
candidates: they are tested only on what they have been taught in teacher 
training. However, it ma^y-'not be fair to students,^ in the classroom because 
..this approach assumes that colleges instruct teachers in what they need to know 
\ ^ in. order to teach. What teachers actually have to know in ordejr to teach in ^ 
the classroom may differ from what the c6>leges' have prepared them to teach. 

''a second method- job analysis^-solves thisS'^pra^lem and lays the foundation 

— for establishing that"~tFiie test measures a representative sample of knowledge 
'and skills .required on the job, -^in •accord" with federal guidelines. In teacher 
certiiri cation ppograms, job analysis has been used successfully in several 
'' ' sta^s7~ln^irdilTg~Geor^ _ 

Essentially, a job analysis— conducted by survey, observation, and/or 
-interviews— generates empirical data describing what people do in their jobs, 
- thereby identifying the qualifications needed of a candidate who wants to be 
. certified for that kind of job. In one approach to job analysis for teachers, 
skills and content knowledge are defined by behavioral objectives^ which are 
rated by job incumbents (practicing teachers) as to their job relatedness (time 
spent teaching or utilizing the content of the objectTve and its essentiality). 

From the results of the ratings, ttie objectives can be rank ordered by 
these dimensions across the overal'T list and can be ordered within "subareas" 
used to group the objectives. When selecting objectives for assessment, it is 
imp9rtant to select the most job-related objectives in each subarea. This 
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ensures that .the selected objectives reflect the proportional sMe of each 
subarea in relation to the size of the total job-related field. In turn, this 
proportionality design provides* an initial estimate of a blueprint or struc- 
ture for the assessment instrument(s), Which can be developed to reflect the 
relative importance of each subarea containing job-related objectives. 

Using, a. job analysis to defJue. assessment. .dornains..P-rAvjdes an._emRjr1c?l . ^ 
basis for developing the instruments. However, a. certification program in a 
given state should Imeet additional concerns: as the NEA (1982) points out, 

i ' ' ' 

teachers must and should have considerable involvement in the assessmejnt pro- 
cess. Among other ^roles, constituency groups, can help to identify emerging 
fields, which teachers may not teach now but may have to ftext year or the year 
after (e.g., metricsj the use of calculators); they can ensure that the assess- ' 
rrient instruments serve the intended focus of education within the state; and 
they can ensure that the language and structure of the content, is appropriate 
■to-the-reg-ion— (er g., one stat e mi^—teaelv-the— theory of- -eyolutlon.,__whj le_- 
another reqir-ires a different approach). , * 

Empiric-al information from a job analysis and expert judgments from 
teachers and other constituencies can provide the' f6undation for determining 
what to assess. The next step is to determine how to assess the competencies 
identified as essential for entry, initial certification, and classroom per- • 
formance." * ■ 
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Assessment Methods = \ 

I 

♦ ^ I 

Entry Tests 

" * \ 

Assessment of qualificaitions for admission to a teacher education program 
usually occups in the candidate's second ypar of college study, prior to entry 
1 ffto"th«"prograwTSt~thB 'beginT!^ 

assessing such qualifications have most often included teacher recommendations 
oKthe student candidate and an examination of 'the student's academic record 
(grades, course requirements, etc.). However, recently developed programs in 
South- Urolina and Alabama^ have abolished 'this essentially pro forma approach; 
instead, they require statewide entry tests to ensure that ^^^^^^^^'J'^^:^^ 
basic skill\ (e.g., mathematics, corapiunication skills,, general education) 
required for sbme degree of success in the teacher education program. Another 
possibility, recOmnended by Watts (1980), ,is to establish a professional stan- 
dards board" for idmiTsiW that fun^^^^ of training institu- 
tions (as some other professions have done, e.g., engineering, architecture). 

Of these three approaches, the most efficient means of assessing entry 
qualifications appears toN) e some P o r-it i u f e ntry test. W h et her-the-qualifica- 
tions are identified as genehal education (i.e., liberal arts) or literacy and 
basic skills, the. entry tgst may involve assessment methods other than strictly 
multiple-choice, paper-and-penci\^ests. The key is to decide what^to test, 
then how to test it most~eff ectrve~lj(^ and eff icTent1y.~.if entry qualifi cations 
include literacy skills of reading, ^iting, and listening, for example, then 
assessment methods must be capable of measuring the skills required. 
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Sputh Carolina has begun developing a b^.sic 'skills entry test of reading, 
writing, and mathematics, the content of. which was identified through an exten- 
sive sury^y. Alabama has already developed and implemented^ an entry exam 
cajled the English Language Proficiency Test, which -was ac^jninistergd for the 
first time in November, 1981. Its content, derived from a validation^ sQrvey 
of practicing teachers in . all fields, includev,reading;. writ^in^, language 

skills, and listening. .Methods of assessing these \areas are listed ir> the 

• . ■ ' . - ■ . • / ■■ 

ctiart that follows: ^ 

Alabama's English Language Proficiency Test . 



Content Area 



f 



Assessment Method 



Reading . ^ 
Writing 

Language Skills 
Listening 



~ A "cloze" test of r^sadliiu Luiiipi elieii 
sion, using multiple-choice items 
with 5 choices 

— An essay test» scored by the^ holis- 
tic method 

A multiple-choice test (4 choices 
per item) of basic gramiiar, mechan- 
ics, and reference skills 

A listening tape of passages read 
aloud, testing comprehension by 
multiple-choice items 
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In addition to these two state-developed programs, nationwide standard- 
Ized tests are also available. One possibility is the "Common" examination 
portion of the NTE, which currently includes a section on Professional 
Education 'and General Education (social stud-ies, literature and fine arts, 
science, etc.). According to a recent announcement b^ Educational Testing 
Service (No.te 5), the "Common" portion of the exam will be revamped. The new 
version* win essentially include all 'of the prespnt components, plus: a new 
section on Communication Skills (listening, readind, writing). 
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Exit Test s . ^ • . ' 

The next stage fn the process is the assessment of qualifications for 
initially certifyiif^ a teacher who has completed a teacher training program. 
lOnce the qualifications. have been specified, by job analysis -or other, method, 
-a.^number of opti'ons exist for assessing these qualifications. Pas^ certifica- 
tion procedures have been based largely on the .candidate's* completion of an 
accredited teacher pr^aration- program (Hathaway, 1980)^. But this approach 
necessarily assumes adequate standards of competency enforced by eacu program 
and .some relative comparability across programs in a .given state. With the 
increasing conCferris for tlie actual competency level of teachers, fewer states 
are willing to assume the quality of teacher preparation programs; more and 
mSfe states have implemented, or will implement, teacher competency tests for 
initial certification..-- 

In most areas the essential competencies are content based, thus measur- 
able by paper-and-pencil tests. For these areas, states have the option of 
selecting and adopting an existing standardized test if it meets their needs; 
developing their own state-specific tests; or achieving some combination of 
the two approaches.. - ' 

Basic or professional ' skill s tights . Exit tests are often used to assess 
basic skills or professional skills which are common to all , teaching areasi 
For example, Tennessee has used the California Achievement Test as an exit test 
of basic .skill? for all teacher candidates; many states have used one or more 
portions of the IJITE's Common Examination to measure general education and/or 
professional skills. In some state's (e.g., Washington), colleges and univer- 
sities develop and require their owi^ tes'ts of general education. 



' Vnr^ states, FloO'da, Alabama, and Arizona, have developed their own 
irt^trdinefits. .Florida's Teacher Certification Examination, based on a list of 

^ \ ■ * 

23 generic competencies, measures reading, writing, mathematics, and profes« 
sional education. Assessment methods include a multiple-choice cloze test, an 
essay, test SQorfed holistically, and tests of multiple-choice items. Alabama 
reqi'ires that all teacher certification candidates pass a multiple-choice test 
of, basic professional studies, which is based on a job analysis of practicing 
teactiers in all fields. In addition to tijis test, candidates in Alabama. must 
pass content knowledge tesjts specific to their teaching areas. In Arizona's 

■program, currently under development, * all teachers will have to pass a 
multfple-rchoice test of generic teaching 'knowledge and skills. 

One important consideration in choosing assessment methods for profes- 

- sional education or pedagogy tests required of teacher candidates in all 
fields is ^o distinguish between content knowledge, which is measurable by 
paper-and-pencil tests^ and classroom sljills, which are not directly measurable 
bv this technique. Professional skills required Mn the cl^s.prooni 'should be 
measured during the performance, assessment stage of the certification process. 

Teacriii.g field tests . Exit* tests are also used to measure Reaching field 
content knowledge. Here again, statet have, several options for selecting or 
developirig tests cor this pjjrpose.. Choosing the NTE, which provides tests of 
some 26 specific areas, has its advantages and disadvantages.^ On the positive 
side, the NTE is relatively inexpensive (co(npared -to the cost of developing new 
t§sts).' Also, it can be adopted' and implemented in a relatively short time— an 
important factor if 'staite law or mandate requires' rapid implementation. On the 
negative side, atdoption of the NTE can pose some po^-^^tial problems. First, 
it has a preestablis/iied set of teaching area tests available; thus, a state 



roust adapt its certif I'cation /ireas, to the test and adopt, some other method of 
certifying in areas nop covered.. Second, to conform to legal requirements^, the 
NTE must usually be validated within t'he state where it will be used (a process 
which can take up to three years). That is, ..the content of the tests must be 
compared and analyzed empirically in relation to state teacher preparation cur- 
riculum, (as has -occurred in South Carolina and will occur in Virginia) or to 
the results of a. teacher job' analysis. Thus, adoption is not as straightfor- 
ward and ^ unencurnberedi as li may seem. Third, the NTE provides normative- 
referenced scores comparing a student's performance to the perforhiance of 
^ others; or, in some modified, programs, to standard scores determined within a 
state. The student receives a*pass or fai'l and a numerical score but (unless 
special modifications are made) no indication of strengths and weaknesses, 
vJhich could be ex treme;iy. helpful both to the institutions and to the students 
who must retakevthei test(s.) . 

If adoption, of avrailable tests 'does not satisfy a program's requirements, 
then a second alternative is to develop new tests. In the past four years,-' 
severa] states\have developed their own content area tests for initial teacher 
certification to meet their- own specific needs. Georgia began in 1975 and has 
since implemented tests in 23 different areas, all of them based on extensive 
job ah,alysis and teacher tnvolvemept. Tests in eight more fields are currently 
under development. Alabama has developed tests in 31 areas, which were admin- 
istered for the first time in December, 1981. Oklahoma has developed tests 
covering 79 different areas: 26 general tests for individual teaching fields;.' 
8 umbrella tests for^',such fields as Social' Studies, Mathematics, and Language 
Arts; and 45 specific area tests, which must be taken along "with the appropri- 
ate umbrella exam(s). 



The advantages of full-scale '^ivelppmentaT efforts are manifold: the 
■criterion-referenced tests are basa'i on job analyses; the tests match the 
state's certification areas; they can' be empir-.>.jlly content validated; and 
test scores provide indications of reljative strengths and weaknesses on 
specific domains within .*ach test. In addition, teachers and administrators 
within the state participate in the development process, thereby ensuring the 
relevance of the tests and helping to instill grassroots support for the test- 
ing program^ As to disadvantages, the first is cost: large-scale development 
projects can be expensive. The second, which only applies in some cases, is 
the' time required for development. If done properly, programs of such magni- 
tude and complexity require anywhere from one and a half to four years for 

4 

development, which may be a disadvantage if a mandate has limited the time 
available. ^ 

But what if neither of these alternatives— adoption or development- is 
suitable? Some states have combined certain aspects of each, alternative to 
create specially tailored certification programs, for example. South Carolina, 
is currently in the process of developing teaching area tests in teh specific 
fields. Other certification areas offered' in South Carolina require the 'can- 
didate to take specified portions .of the NTE. However, since a state law 
requires' the reporfilig of strengths and weaknesses for all teacher tests, the 
NTE's normal scoring method must be altered to suit South -Carolina's require- 
ments. 

SpeciaY concerns . While these procedures for selecting or deUloping 
paper-and-pencil. tests. may seem relatively straightforward, a number of special 
concerns will arise during the process. The first, mentioned earlier, is the 
design of certification' areas: general tests, special area tests, and so on. 



In a field such as special education, this can be a critical and volatile 
issue. The second, also related to test design, is the need to provide subtest 
or domain scores. This provision requires careful design to ensure adequate 
measurement of knowledge and skills not just within the test as a whole, but' 
also within each subtest or domain. The third special concern, related to 
assessment methods, deserves more, detailed exploration at this point: mul- 
tiple-choice, paper-and-pencil tests may not adequately cover the representa- 
tive domain of content knowledge in some teaching fields. While some special 
fields may be "low incidence" (i.e., only a few people certified annually), 
thus delegated to local assessment programs, they must be considered first at 
the state level. Several example^ here may be helpful.- 

The content knowledge required of a prospective teacher in music or\ a 
foreigh. language may only be partially covered by a paper-and-pencil test. 
Music' teachers must also be able to listen to and recognize musical selections, 
proper articulationi misplayed notes, and so on. For this reason, both Georgia 
and Oklahoma have developed listening tests in music: examinees listen to the 
tapes, then answer multiple-choice questions. Similarly, a foreign language 
teacher must be able to speak and to understand the language; thus, tests such 
as the NTE and those developed in Alabama, Georgia, Oklahoma, and South Caro- 
lina include language-tape tests. However, in most cases, speaking tests occur 
at the local rather than state level. 

In vocat*ional areas, on-the-job performance is often essential to the 
teachfer candidate's preparation. The area coitmonly called "Trades and Indus- 
tries (T & I)," for example, includes trades as diverse as cosmetician, 
tacilor, and diesel mechanic. Most states require a T & I teacher to" be 
licensed and experienced within the trade he or she would teach; they also may 
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require some amount of teacher training. Tests in- the T field, as in 
South Carolina, for example,- may include both a paper-an'd-pencil test of the 
generic skills taught in teacher training to all T & I candidates and an 
actual or simulated performance test of trade skills (conducted by the 
colleges themselves). 

In sunmary, these kinds of special concerns will undoubtedly arise in the 
development of a certification program. Efforts to accommodate these concerns 
mustconsider the use of alternative assessment methods to measure skills which 
cannot be assessed adequately by strictly multiple-choice, paper-and-penci i 
tests; at the same time, they must consider the cost" and practicality of alter- 
native, methods (Priestley, 1982). 

Classroom Perfo rmance Assessment 

'V - 

The third and final stage of initial teacher certification is the assess- 
ment of classroom performance, usual ly conducted during the period in which the 
candidate holds a temporary or "provisional". license or certificate. The basic 
goal of performance assessment Is twofold: (1) to help the teacher improve his 
or her skills, and (2) to collect infonnation--on which to base an^ administra- 
tive decision as to whether or not the candidate should receive full or "per- 
manent" certification. Scriven (1981) distinguishes between the assessment 
methods appropriate to these goals by identifying the requirements and benefits 
of formative and sutimative evaluation. 

Achieving these goals demands that assessment of performance be limited 
to competencies that teachers would be expected to possess as entry-level pro- 
fessionals, and that, the assessment methods provide fair, reliable measures of 
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conSpetencies determined to be essential. The first demand is a function of 
defining the domain of essential competencies, a process that may be based on 
teacher tra^ining curriculum or on job analysis (as stated earlier in relation 
to content-based test's). The second demand, for adequate assessment methods, 
requires a broader perspectivie.; '." * 

As MacOonald (1973) reported, the state of the art of performance assess- 
ment technology was a '•'^rather-depressing picturfe" in 1973. Since then, how- 

ever,, considerable progress has been ma^de as the demand for more effective 

' ^ ■ )' ■* ■• . 

methods has become more clamorous and persistent. Unlike at the first two 

f - . ■ 

stages— entry and exit tests— assessment at this third stage does not include 

« 

the option of standardized^ off-the-shelf -Instruments for performance assess- 
ment. On the other hand; the methods available are numerous, and there are 
programs to consider as potential models for the development of performance 

* 

assessment procedures. Host important at this stage is the development of an 
assessment that meet;, the specific needs of a state or local program, at the 
level on which actual evaluations will occur. 

In terms of methods for assessing teacher performance,- Medley (1978) 
constructively proposes six general alternatives, and Haefele (1980) critically 
reviews twelve (with considerable overlap among, the alternatives presented). 
Millman (1981) examines a number of methods in depth, with relation to their 
use in teacher evaluation, -and many^of these methods can be adapted for use in 
assessment for initial certification. 

Simply classified, the methods of assessment involve three basic types: 
observational ratings of the teacher in action (by students,- peers,< supervi- 
sors, principals, independent evaluators); training/simulation exercises; and 

■ 5a 
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testlng of the teacher's classroom students (e.g., before and after instruc- 
tion). Within each of these categories are a number. Af spftcific techniques, 

« s ... 

but some are more useful than others in measuring performance on specified 
competencies. For example, as Medley (1978) points out, Popham's (1975) sug- 
gested approach of the "teaching test" and related approaches involving .pre- 

and post-tests of the^j-classroom students can really only yield overall means 
\ - ■ 

\nd test scores. While these kinds of data might be useful, they cannot be 

matched directly to specified teacher competencies. Assessment of particular 

skills identified as essential to adequate teacher performance requires the 

use of methods that can measure and provide feedback on each skill, for both 

formative and summative needs. 

^Programs designed to,.. accomplish these purposes have been developed in 

i'^ South Carolina and .Georgia. In 'ft he Georgia program, in addition to meeting 
the requiremen,ts of course credits and,, grades, and passing a criterion- 
referenced teaching area test, the teacher candidate undergoes performance 
assessment during the first year while holding a .provisional certificate. 

Georgians Teacher Performance Assessment Instruments (Note 6) were devel- 
oped to measure performance in relation to specific teaching skills identified 
through an extensive survey a^ both 'generic and essential to teaching in all 

• fields. Assessment is governed and provided by five different instruments, ^s 
described below: ' / 
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Instrument 
• Teaching Plans and Materials 
Instrument (TPM) • 



• Classroom Procedures 
•Instrument (CP) 



Method of Assessment^ 
A. portfolio of fnstructional preparation 
rated by data collectors who also inter- 
view the te^(Cher 

Direct classroom observation of teaching 
methods and prac^ti^ 



• Interpersonal Skills 
Instrum|nt (IS) 



Direct classroom observation of the 
teacher's ability to create a sociable 
atmosphere and manage classroom inter- 
actions 



• Professional Standards- 
Instrument (PS) 



Interviews with the'teacher, his or her 
colleagues, 'and supervisor to gather 
information on professional conduct (com- 
plying with policies and procedures, 
participating * ib professional growth 
activities, etc.) 



• Student Perceptions 
Instrument (SP)' 



A questionnaire filled out by students, 
composed, of items parallel to those in 
the CP and IS instruments 
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For. each of the first four instruments, at least' three trained data collectors 
(peers, supervisors, principals, 'independent evaluators, . et \ al.) rate the 
teacher's performance on each indicator on the basis of a 5-point scale. Mean 
scores across all raters and all indicators are calculated by computer or by 
hand. 

It is important to note- here that only the first three instruments are 
used' for sumnative certification decisions; the other two— student perceptions 
and professional standards instruments —are, used formatively to determine the 
need for in-service training and to create teacher performance profiles. 

Conclusion 

This paper, in relation to several conceptual issues, has explored a 

number of assessment options for initial teacher certificatipn. A basic tenet 

stated at the outset is that assessment should occiir at three stages'^ before 

" • \ ■ 

admission to a teacher training program, upon completion of the program, and 

during on-the-job performance in the classroom. Certification should be based 

■ - ' 1 

on at least these three assessments and not on any> one of them as t|ie sole 
criterion. \ 

Regarding the assessments themselves, the content or domain of what to 
assess should be defined carefully, preferably through job analysis an(^ with 
extensive teacher involvement. Assessment instruments should then be jiesigned 
and either selected or developed to measure the specified domains as. effec- 
tively and efficiently as possible. Above all, given the recognition! and 
acknowledgment of ' the fact that states' needs, teacher training programs, land 
qualifications for different teaching fields vary considerably, all assessment 
methods should be fitted to the specific needs of a given situation. No One 
all-encompassing^' solution is possible for assessing competence in a professi|5n 
of such importance, variation, and frequent change. 
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a. 



Introduction 



Teacher 'certification testing programs present challenges to the practi- 
tioner regarding several technical issues. Parts I and II of this papeV- will 
focus on standard setting and equating, and validity and job analysis, 
respectively. A review of these technical issues requires a delineatnon of 
the methods currently in use. In each, section that follows, present approaches 
will be described . and discussed. Recommendations for alternatives Will be 

\ 

suggested where appropriate. 

• ^ 

c 

standard Setting 



Clearly one of^he'most significant aspects of tests developed and used 
for employment decisions is setting the.passing score or cut score. This area 
of research is a broad field of its own—replete with legal factors, technical 
concerns, and logistical considerations. There are several models available 
for standard setting. Koffler (1980) and Hambleton & Eignor (1978), among 
others, have studied various methods and eixamined their appropriateness, 
accuracy, and usefulness. Many methods of standard setting have .'been used 
frequently in student competency assessment. These method's include: 
Nedelsky (1954), Angoff (1971), Ebel (1972), Jaeger (1978), Contrasting Groups 



and Borderline .Groups (Zi.eky & Livingston, I977j. 



/ 
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However, wh4r) one'reviews the methods actually used in setting cut scores 
for teacher certification testing, one finds a smaller list than the one above. 

^ ' w ■ \ 

In the oast 1\ years, state-mandatedXuse of the National Teacher Examina- 
tions (NTC) often involved the administration of the exam and the establishment 
of a passing score jby state administrative \decision. The procedure was not 
empirical, nor did ii result in a cut score that systematically bore relation- 
ship to successful {performance on the job; In\some states the use of the NTC 
with an arbitrarily 1 set cut score was legally\ challenged. In the-fcmowlng 
cases the continued! use of the, exam in this mdnner^^was^t- allowed by law: 
United States v. NoVth Carolina (igZSV-BflTer \),. CoTumbus Municipal Separate 



<er 

School District (1976), and Georgia AssociatTdnl of Educators 'v. jack P. NiX 
(1976). In 3 case/that involves the use of a cu:off score to determine those 
candidates that are qualified or unqualified, th<i user of the test must' give 
sufficiei'it proof that the cutoff was not established in a capricious or 
arbitrary manner. 

In South 'Cv'olina in 1977, it was found tha 
in adverse imp^fet against blacks. However, the 
the test, validate it in South Carolina, and* set /cut scores in a systematic, 
emp.iricaT fashion. The result was that some of the NTE tests were validated 
and approved for use in South Carolina.. This situation qn South Carolina is a 
blend of using an "off-the-shelf" test and a %es% customized for state use. 
This is discussed further. 



the use of the NTE resulted 
state decided to investigate 
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Approaches ^ . 

Underlying moY.t methods used are the procedures ^designed by Nedelsky 
(1954) and Angoff (1971). These procedures have been modified, consolidated, 
lengthened, and abbrev1t..?d for use in several states, they are described and 
discussed below. 

Nedelsky (1954) . 'Nedelsky (1954) has outlined the' procfedure as it would 
be ) .ed by instructors reviewing multiple-choice items to set a standard for a 
classroom test. 



Description of the Technique 




J' 



Letter grades F, D, C, B, and A uScd in this article have the 
conventional meaning of fai-lure (F), barely passing (D), etc. / 

The proposed technique -for arriving at the "min^imum passing" ^ 
score of an objective test, each item of which has a single/ correct- 
response, is as follows: 



"o ' 



Directions to Instructors , • 

Before the test is given, the instructors in the course are ^ 
given copies of the test, and the fol:lowing directions: 

. In -each item of the test, cross out those responses which the 
■ lowest D-student should be able to reject- as incorrect,- To the. 
left of the item write the reciprocal of. the number of the 
remaining responses. Thus if you cfoss out one out of five 
responses, write 1/4. . " . >• 

Example. (The example should preferably be one of the items of 
the test in question.) ' 

Light has wave characteristics. Which of, the following isUhe 
best experimental evidence for this statement? 



A. Light can be reflected by a mirror. • . 

. B.'^Light forms. d?rk and light bands on passing through a ; 
small opening. " . " 

C. A beam , of white light can- be broken, into 'its component ^ ; 
colors by a prism. j 
1/4 D. Light carries energy. , 

■ 7. .Light operates a pf^otoelectric cell. ■ . 
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Preliminary Agreement on Standards 

» 

After the Instructors ^ have marked some five or six items 
following the dVections above, it is recommended that they hold a 
brief conference *to compare and discuss the standards they have 
used. It may also well be that at this time they agree on' a 
tentative value of corfstant k. (see section .on The Minimum Passing 
Score). After such a conference, the instructors should proceed 
independently. ' ^ 

Terminology 



In 'describing ^the , method of computing tTie score corresponding 
to the lowest D, the following terminology is convenient: 

a. Responses which the lowest D-stucfent should be able to 
risject as inqorrect, and which . therefore should be 
primarily attractive to F-studerits, are called F^responses . 
in the example .above, response E was the only F-response in 
the opinion of the instructor who marked the item. 

'b. Students who possess just enough knowledge to reject 
F-response-s and must choose among the remaining' responses 
at random are called F^D students, to suggest borderline 
, knowledge betwe.en F and D. 

c. The most probable mean score of the F-D students on a test 
is called the i^-D guess score and is denoted- by MpQ. c As 
will be shown later, Mpp is equal to the sum of the 
reciprocals of the numbers of responses other' than 
F-resDonses. (In the example above, the reciprocal is 1/4.) 

• d. The most probable value of the standard deviation 
corresponding ;to Mpp is denoted by a^^. 

It should be clear that "F-D students" <?> iS' a statistical 
abstraction. The; student who can reject the F-responses for every 
item'of 8 test arid yet wil.l, choose at random among the rest of the 
responses probably does not exist; rather, scores equal to MpD 
will be obta-ined by students' whose patterns of responses vary widely. 
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The Minimum Passing Score 



The score corresponding to the lowest D is set equal to Mpp 
ka.pp, where Mpo is the mean of the Mpp obtained . by vario;ls 
instructors, and k. is a constant whose value is. determined /by 
several considerations. The F-D students are characterized not/ so 
much by the. positive knowledge they possess as by being able to 
avoid certain misjudgments. Most instructors who have used th^e F-D 
guess score technique have felt that this "absence of ignofance" 
standard is a mi-^I^d one,^nd that therefore the minimum passifig score 
shoul^d be such as to fail the majority of F-0 students. As;signing 
to l< values -1, 0 ,1, and 2 will (on the average),. fail respectively 
16 percent, 50 percent, 84 percent, and 98 percent of the F-D 
students. An informed final decision on the value of k/ can be 
reached after the instructors have chosen the F-responses^', for at 
that time they ere in a better position to estimate the rigor of the 
standards they have been using. In keeping within the /spirit of 
absolute standards, however, the value of k. should be^ agreed on^ 
before the values of Mpp are computed and certainly /before the 

students' scores are shown. / 

Jt is the essence of ty^z 'proposed technique that ^the standard 
of achievement is arrived at by a detailed considerat^ion of indi- 
vtdual items of the test. Only minor adjustments sholil^d be effected 
by varying the value of k_. The reason for introducing constant k^, 
with the attendant flexibility and ambiguity, is that FZ-responses 
in most examinations vary between two extremes; the v^ery wrong, the 
choice of which indicates gross ignorance, and' the^moderately wrong, 
the rejection of which indicates passing knowledge, kf a particular 
test- has predom^inantly the Tirsf kind "of F-yesponses, this 
peculiarity of_ the_ test can be corrected for by piving k^ a high 
value. Similarly, a low value, of k. will correct for the predomi- 
nance of the second kind of F.-responses. It is expected that in the 
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majority of cases a change -of not more than +.5 in the tentative 
value of _k agreed upon during the preliminary conference should 
introduce the necessary correction.. It would be difficult to find a 
theoretical justification- for values of k_ as high as two; for most 
tests the value of k_ - 0 is probably too low. This suggests a 
rather narrow working range of values, say between .5 and 1.5 with 
the value k_ = 1 as a good starting point. 

If a Dart A of a given test consists of items, each of 
which has S;^ non F-responses (one of these' being the right 
response), the F-D guess score for each item; i.e., the probability 
that an P-D student will get the right answer in any one item, is 
D 1/cA. The most probable values of the mean and the square 
of the standard deviation on this part of the test are given by 

= and - (1 Pa)Na- "fD 

opp for the whole . test, are given by Mpp = =M^ and 
o*PP ■= a\. The value of Mpp must be accurately 

computed for each test, opp, however, may be given an approxi- 
mate value. In a test of five-response items^s may vary from one to 
five. If these five values are equally frequent, opp = .Al/T. 
If, on the other hand, the extreme values, s = 1 and s - 5, are less 
frequent than the othfe- three values, as seems likely to be true for 
most tests, .41y4r <(jpp ' <.50/C Since K<ypp is usually 
much smaller than' Mpp approximations are in .order. With k. " 1 
and opQ = .45, /FT, the equation. Minimum P3ssj% Score = Mpp- 
+ .45/ /FT, should work out fairly well in the majority of cases and 
is therefore recommended as a starting point ,in experimenting with 
the proposed technique. 

Refinements of the Technique 

The definition of the F-response given above has an element of 
ambiguity. The lowest D-student may be expected to reject a given 
response on its own merits as clearly incorrect or because it is 
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clearly less correct than some of the other responses. In the 
example given under "Directions to Instructors" response E_ cites 
. evidence against the wave theory of light and thus is an F-response 
on its ow.i merits; other responses "are consistent with^ the theory 
and may be considered non F-responses., It may be argued, however, 
that even a D-student should see that response D constitutes less 
• cogent evidence than some of the other responses, and that therefore 
..it is an F-response. Judging a response in comparison with other 
responses is theoretically sound, for it probably more closely 
corresponds to ^ the mental processes of the student. To make ^a: 
proper judgment of this kind requires time and considerable 
pedagogical and test-wise sophistication; with responses more 
heterogeneous than in the example cited a reliable judgment may be 
impossible. Experimentation with both definitions of the F-response 
is certainly in order, but at least in the beginning, the simpler 
version, i.e., judging each response on its own merit, is -to be 

preferred. ' ' . 

Some instructors find it difficult in a good number of cases to - 
decide whehter a response is an F-response. There is no theoretical 
reason against assigning to such a response half the statistical 
value of an F-response. (If, in the example cited, response D has 
.been assigned the value of 1/2, the item would have fiad 1.5 
F-responses and 3.5 non F-responses. Consequently the value of P 
for the item would have been 1/3.5 rather than 1/4.). If .methodically 
.and conscientiously pursued, such a procedure may result in a better 
^ agreement among the instructor's.. It is not recommended as a sub- 
stitute for clear and hard thinking about the degree of correctness 
of a response. 

In theory, the proposed technique can be extended to assigning 
minimum scores corresponding to grades C, B, and A. .Tlie author has 
few data bearing on such an extension; they indicate fairly clearly, 
however, that a very thorough discussion of the meaning of the grades 
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of C, B, 5nd A among the participating instructors must precede 
actual marking of the test. It seems fairly certain, moreover, that 
even If the instructors reich a really circumstantial verbal agree- 
ment on the meaning' of these grades, modifications of the proposed 
^ technique are likely to' be necessary. For, though an "absence of 
ignorance" standard may be adequate for identifying the barely • 
passing students, more positive indications of achievement 
corresponding to higher grades seem desirable. 

Perhaps a reasonable D-C guess score can be obtained by 
requiring the lowest C-students to reject responses ,that are in 
certain respects or, to a certain degree, inferior to other ^ 
responses; the kind and the degree of inferiority must, of course, 
correspond to the instructors' definition of the meaning of the grade 
of C. To establish minimum scores- corresponding to grades B or A, 
an instructor should probably focus his attention on the correct 
response and inspect the wrong responses primarily for their degree, 
of deviation from the correct response; the allowable deviations for 
.the lowest B orv A will depend on the meanings assigned to these 
grades. 

As the preceding paragraph suggests, the criteria used for 
determining the minimum scores corresponding to lowest D, C, and B 
or A may be qualitatively different; the method for computing these 
scores may be the same for all grades, e.g., lowest C score = 

D irections to Instructors 

a. In each item of the test, cross out, using a single pencil 
line, those responses which the lowest^ D-student should be able 
to reject as .incorrect. To the left of the item, against the 
- D-response, write the reciprocal of the number of the remaining 
responses. (Thus, if you cross out one out of five responses, , 
write -1/4.) 
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b. Of the remaining responses cross out, using a double line, 
those which the lowest C-student should be able to reject* 
Write the reciprocal of the number of responses that still 
remain to the left of fhe C-response/ (Thus, if you had 
already crossed out one out of live possible responses', and now 
cross out two more, write 1/2.) 

c. Repeat the procedure for the lowest B-student, using a triple 
line. 

d. /Repeat the procedure for the lowest A-student, using a cross. 

^xaiTiDle: Light has wave characteristics. Which of the 
following is the best experimental evidence for this statement? 

Light can be reflected by a mirror. 

Light forms dark and light bands on passing through a 
small opening.. 

A beam of white light can be broken into its- component 
colors by a prism. 

LMght carries energy. 

L^ght operates a photoelectric cell . 

Ij) the opinion of ihe instructor who marked the example above, 
Hsponse t; should be rejected by the lowest D-student, responses A 
and D by the lowest C-studerit, and response C by the lowest 
B-student. Since the letters of the responses happen to corpspond 
to the usual letter grades, it U convenient to record the recip- 
rocal of the number of "responses among- which the lowest D-student is 




1 A. 
1 8. 



1/2 C. 



1/4 D. 



7j 



-10- 



to choose against the. D-response, etc. In the example ?bove, the 
lowest B-student is expected to reject all but the correct response; 
the lowest A-studient is of course expected to do just as well; hencfe 
number 1 is placed aga.inst both response B and response A. 

It\is possible to construct a test in such a way as to make the 
determination of the scored corresponding to lowest D, C; B, and A 
easier and more reliable. In such a test some responses would be 
designed to be attractive only to F-students, others to F-students 
and- D-students, etc. By including predetermined numbers of such 
responses the test maker can prepare a test having any desired value 
for the minimum score corresponding to any letter grade. Whether or 
not absolute standards are to be used, a test of this kind is likely 
to have the advantage of being discriminating in the whole range 
from F to A. 

>^ 

(Nedelsky, 1954, pp.4-10) 



Descriptions of the Nedelsky procedure outlined by Glass '(1978) and Zieky & 
Livingston (1977) adapt the original Nedelsky procedure for easier implemen- 
tation. The Zieky & Livingston description includes a simplified case for 
only the minimum competence level, while the Glass description includes the 
consideration of groups of students at different competence levels. 

Angoff (1971) . In the Angoff (1971) method, expert judges review a test 
item in its entirety and state the probability that a person with minimum 
competency can give the correct response. The Angoff procedure is easy to 
explain, easy to understand, and easy to administer. It is less time 
consuming than Nedelsky's (1954) and can be used on open-ended items. 
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In this procedure: 

... ask each judge to state the probability that the 'minimany 
acceptable person' would answer eich item correctly. In effect, the 
judges would think of a number of/ minimally acceptable persons, instead 
of only one such person, ajid would estimate the proportion of minimally 
acceptable persons who would answer each item correctly. The sum of 
these, probabilities, or proportions, would then represent the. minimally 
acceptable score (Angoff , .1971, .p. 515). 

- Jaeger (1978) . Tri addition, a ^method proposed by Jaeger (1978),, and used 

for standard setting for student assessment, deserves mention. This procedure 

maximizes the.oinvolvement of educational constituencies". In the Nor' rt' Carolina 

application, 700 persons convened in groups of 50 to proceed through the 

standard-setting model. The procedure is as follows: 

Judges were first%required to take the exam they, would later rate. For 

each item, judges were asked one of the following questions: 

' 1. Should every high school graduate be able to answer this item 
correctly? ^ ' 

2. If a student does not answer this item, should s/he be denied a high 
school diploma? 

Judges next received the results of the above sTfrvey questions as well as 
actual performance data. With this information, judges were asked to review 

e 

,and revise their initial" judgments as they considered necessary. 

The procedure then calls for recalculation of the judges' ratings, 
redistribution of the new ratings, and anot|ier judgment. Judges then received 
information on the proportion of students who would have passed or failed, as 
determined on the basis of the recommended cutoff scores. 
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With this information, judges were asked to make a final statement 9n the 
•necessity'' for each item on the test. 

Median scores were calculated by group (type or constituency), and the 
passing score was then set at the minimum median score calculated for a group. 

This process is technically straightforward and involves iterative 
reviews, and the iiiclusion of normative student data. 

Procedures in Use 

' ' . (J 

Georgia,- .Alabama . In 1977; Nassif (1978) employed a procedure which 

began as a modification of Nedelsky. The desire was 'to simplify the Nedelsky 

procedure on two d'imensions. Each item was to be reviewed in ite entirety, 

rather than reviewing each component (i.e., each distractor), and one level of 

competence (minimum acceptable) was considered rather than several. The 

resulting procedure conceptually matches Angoff. The procedure o^ nationally 

defined is as follows: " \ 

Panels of expert judges reviewed items independently on an 
item-rby-item basis. The following was asked about each valid item:, 
"Should a person "with minimum competency in the teaching field be able to 
answer this item correctly?" Each judge was. asked to imagine the skills 
of 3 hypothetical candidate with minimum competency, in the content of a 
teaching field. Within this frame of reference the. item was examined as 
to whether it requir-ed too sophisticated a knowledge of the content or 
whether it required cbntent knowledge of trivial or minor importance. , 

Judges responded "yes" if the item was considered appropriate for 
measuring minimum competency or "no" if otherwise.. 'The "I don's know" 
option was available for judges unfamiliar with the content of an item. 

The significance' of agreement was determined by comparing the" number 
of "yes" responses wijth probability tables for the binomial distribution. 
The ratings of "I don't know" were not considered for any item, so that 
dichotomous ratings with different numbers of judges were generated. If 
the probability of receiving a given number of "yes" ratings 11. e., 
appropriate for minimum competency) was less than a -chance of 1 in. 10, 
the item was classified as an appropriate requirement for minimum 
competency (Nassif, 1978).'" 

ERLC • T;^ 
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This procedure has been used both in the Georgia and Alabama teacher certi- 
fication programs. 

South Carolina, Oklahoma . As mentioned earlier, South Carol inajpndm:ted^ 
a post hoc validation of the NTC in 1977^_2Miie--st-3ndafcr^setting portion of 
this procedure,_iiiodif-icatiofrto^the Angoff procedure was used in which judges 
— ^selected^the^ probability that minimally competence candidates should be able 
to answer an item correctly from a seven-point scale, rather than providing 
. the probability. While this restricts a judge's choice of response, it eases 
data reduction and analysis. '< 

In a subsequent teacher certification effort. South Carolina embarked on 
developing ten content area test's and a basic' skills education entrance test. 
The Angoff procedure as described earlier was used for the corj.tent tests. The 
Jaeger approach was employed for the Basic Skills test. . 

In the Oklahoma Teacher Certification Program,, the Angoff approach was 
used to determine the standards for the tests. 

Florida . The Florida Teacher Certification Exam program involves 
assessing candidates on competencies in four areas: Math, Reading, Writing, 
and Professional Education. Each of these areas forms a separate subtest 
which the candidate must pass. There is, therefore, a separate cut score 
established for each section; The Writing section is scored holistically and 
the standard passing score is set by State Board review of performance data 
and the level of competence described by .the score points on the possible 
performance range. 

The cut scores on the' three multiple-choice sections are set separately 

c * 

by an Advisory Committee and approved by the" State Commissioner. The proce- 
dures used to set the cut score involve a review of performance data generated 
by a field test and an examination of sample items and^heir associated Rasch 
ERIC calibrations to determine which items represent the cut score. 



Advantages 

Why are the Nedelsky, Angoff, and Jaeger apDroac'hes used predominantly 
when the list of methods used to set standards for other canDetency testing 
programs contains several other ^standard-setting, models? (For example,, 
other methods include: Contrasting Groups and Borderline Groups; Ebel; 
Administrative Decision (see Nassif, 1979 for a discussion of these models).) 
Several reasons follow: 



These procedures are based on and permit an item-by-item review. This 
is a very important consideration for tests tha^ are regenerated in 
part , quite frequently due to test security and job analysis 
requirements. 
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§ The procedures permit the incorporation of performance data in' 
judgment if desired as additional information in the- decision-making 
process. 

§ These procedures allow the establishment of single or multiple cut 
scores as necessitated by the testing program. In the case of 
multiple cut scores, compensatory or disjunctive scoring oan take 
place. 

§ These models a^e easy to- understand— a factor which should contribute 
to the reliability of judges' ratings and to the comprehensibility by 
constituent audiences. 

§ These Involve and rely on expert judges. 

t The- cut score that is set does bear a relationship to necessary job 
performance--a legal requirement. It allows alT competent candidates 
to pass, without restriction from quotas. 
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\ • They do not require. Information (statistical or demographic) not 
generally , avail able, ^ 



• These methods produce a cut score^ which can ^e adjusted easily by 
standard error of measurement to incorporate relevant employment 
' factors. ' * 



■ • Jhese methods can be emolpyed on any number of items, although the 
original Nedelsky and Jaeger approaches are prohibitive due to the/ 
length*trMihe process. - . ' 




Until recently, few .s^'es had been dgne^ comparing the results of using 
different cut score models. InS^6, Andrew & Hecht found that different cut 
scores resulted using the Nedelsky 3>Ki the Ebel procedures; Skakun & Kling 
(1980) review,ed modif-.,sc' fbsT and^edetsW procedures, along with their 



currently used,^ normative approach. While the ma^nitude^of the differences in 
yielded cut scores varies across comparisons, th^ found that "resufts 



Indicate that different approaches .for establishing |a passing score on an 



examination produce different standards", (Skakun &|Klirfg, 198G> P. 233). 

• ' .J ' , \ 



Brennan & Lockwood (1979) found "different- cut scores ^produced by NedeKk^^nd 
Xngbff procedures. r , ' 



I 



Equating* 

I - 

I 

Teacher certification testing programs generailly provide the candidate 
with multiole opportunities to retake the exam he/she has failed. If the 



* The author, wishes to acknowledge the contribution's to this section of the 
paper by Dr. StevejuUng-f Gunn . 
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same questions are used repeatedly, the.examihen will not know if the candi- 
date's knowledge of ,thje sybject matter is being assessed or his/her memory. 
Another issue that has arisen is th.at public scrutiny. V certification exams 
^ may require dissemination of 'the test even after: only a single administration. 

One response to these issues, and perhaps the most prevalent, has 
been an increased emphasis on development of parallel 'forms of tests. 
This response is understandable for three " reasons. First, < the 
availability of parallel forms reduces the problem 'of. test security 
{jetw een ad m l n i s ttal lo r is. Secomf; — it ans w e r s — poWical pressure to 



release the- test ^fter administration for use in diagnosis of candidates' 
weaknesses and tailoring of remedial services. Third, it ensures that an 
.individual student may be retested on the same skills with different test 
items, minimizing the effects on' performance of the prior administration. 

The increasing need for alternate forms of tests in programs across 
the country has redoubled interest among researchers," educators, and 
policy-makers in how best to ensure that the score or pass-fail decision 
for a given student not depend on, "which form" the student took or "when 
the student participated. The statistical problem of test form equiva- 
lence takes two primary forms. The first is maximizing the Jikelihqod 
that a student would receive the same score on two different forms of a 
test. The second is a more simTTTFieJ^ask of minimizing errors of 
classification— that is. maximizing- the like.lihood that a student will 
receive the same cljjssification. (pass or fail), although not necessarily 
the same score, on two alternate forms. The 'former is most appropriate 
when the purpose of testing and the prescribed use of testj[esijts.j:s^^ 
"an-?^yze=3 zstoxjeT)t''S"lg vg1 "of"fQnction ^'n^^^ W"'^ rnmft»iN»-H-t--Tr7im adm stra- 
tion to admin istratidrTT^he latter is most typical of minimum competency 
testing programs that are directed primarily at determining a student s 
status simply with respect to a cut score (Nassif, Pinsky, & Rubinstein, 
1980). 

In DracticaKterms, there are two approaches to accomplishing statistical 

He 

equivalence of alternate forms. One is to "equate tests" by selecting items 
with equivalent psychometric characteristics; for example**, the p-value method 
'(Nassif, Rubinstein, & Pinsky, 1979) or fit the Rasch model (Wright, ]9J7). 
The other is to "equate scores" by paying relatively less attention to the 
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Dsychometric char?»cter1jst1cs of individu«-r items (except in the normal cours,^ 
of screenings for Dsyclipmetric adequacy)- and solving the 'statistical problem by 
scaling the test (or subtest) scores produced; for example, the linear and 
equipercentile methods (Angoff, 1971). 

This section is not meant to be a comprehensive technical analysis of 
(^equating : methods, factors,* and consequences. Angoff (1971)-, Jaeger (1980), 
Wright (1977),' Kolen (1981) to name a few, have presented research on various 



aspects of this topic. Of primary importance in this discussion is that the 

purpose is to make the practitioner aware of some aspects of this .complex 

* ♦ 
process that so directly affects the area of teacher. certiTication testing. 

In addition, the reader should know that numerous avenues for guidance or 

assistance exist for solving these technical issues. 

The methods one can us6 for equating are numerous olj course. As in the 
s.tandard-setting section'> of this paper, the methods frequently used for 
teacher^ertification testing will be described with an indication of which 
states are adopting which approach. 

Following, are brief citations of the linear equating technique, the 
Prvalue item substitution method of the Rasch model. 

. Linear Equating (Angoff, 1971) . The linear equating modeT is/ stated 
simply as follows. Raw scores are converted to scale scores sD^hat the 
emphasis is on Correct score conversion. Scores are calibrated , to adjust for 
variations in'test difficulty and dispersion by using a set ,ov items common to 
botti forms of the test. The purpose of this common section is to establish a 
statistical link between the two test forms. Through this link, scores on the 
second form can be calibrated, to the .scale of the first form. (The approach 
under consideration hera^ is the one that"ut|mes , a separate test to each 
oroup with a coitmon test to both groups (cf . Design IV, Angoff, 1971) . 
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— Given two.differeht groups, assuijied to be random samples from the same, 

population, as in the icase of candidates being tested at various administra-; 

; , . _ i 

t1ons, »taking tests xjand with a common, anchor (u) given to both groupsi 
(e,g.. Group A takes x' and u; Group B takes y and u), statistical assumptions'; 
are applied to ftstijnate: ; I 



for each test, if it were given to the 
total group (T = A + B). 



The goal is to transform raw scores on y (the. new form) to tbe scale of x (the 
ori"ginal form). Then, given the^sitimate'd parameters, the conversion equatf^ 
is defined as: - * ' 



/3l_ 




-X t 
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y t 


! 5yt 



Where a = 



Sxt 



M ''i 
and 



aJ^ + 6 



5yt 



b = _A 



- a _A. 



The Tucker model of linear equating is used when the two groups do not 
differ widely in ability. as measured by performance" on u. The Lev^ne method 
is used when the. two. 'groups -do dijfer". ^-i^-- ^ 

Forms of the Notional Teachers .fxamination, administered several times a 
• yedr; are equated "(Angoff, 1971) by linear or equipefcentile methods. 

The'^AlaJjama and Oklahoma State Teacher Certification Jesting Programs are 
'designed to use' the Tucker linear equating methods The anchot tests in these 
programs, are the subset of items Which are repeated across two successive 
administratioris, ile.," common to". both adminirarations. The anchor tests 
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contain the, same distribution of Items, by content area <s each total test and 
comprise 70-80% of each total test. New Items replace, previ9usly used items 
matched on content and difficulty. Upon analyzing the test data, items with 
the best statistical properties represen^tlng the strojigest content coverage 
constitute the scorable Items on the new main form. |*he scorable items are 
equated to the same number of scorable Items on the form previously 
administered. The . data are reported on a converted iscale which allows the 
same reported cut score across different test forms /or fields and muUiple 
administrations. • ' - - / 

The advantage of using linear equating is that equivalence of scoring .is 

ensured. However, a disadvantage in-<he teacher certification environment 

i ' ■ I - 

occurs, in teaching fields with too low incidence o,f candidate examinees for 

equating. In low indidence fields, the..p-value apprjoach, described later, may 
be. more apDrooriate. An advantage to the linear equating method is that items 
need not be field tested prior to the administration! in which they are used as 
scorable items. / 

:it should be noted that/ linear equating is /only appropriate when the 

relationship between the raw scores and the transformed scores is, in fact, 

/ 

linear. Where significant deviations from/ linearity are observed, 

/' . 

equipercentile methods should be used (Jaeger, 1^80). 

^ P-value/P6int-Biserial Test Equating . This; straightforward plan requires 
:the construction - of tests with equating, Replaceable, and experimental 
sections.^ Each of these sections is a -mini-test, in that it form's a 
•stratified sample of items from the entire ' test domain. (The pass/fail 
decision is based on the scorable replaceable items; that is, the experimental 
items do not contribute to the ' examinee's -score..) . The substitution of 
.experimental items into scorable replaceable -items is done within an objective 
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for items of the same difficulty and comparable point-bi serial (item-total 

b * 

test discrimination), as determined from the ^jrevious testing session. The 
item substitution plan preserves the content validity of the test as weir as 
the statistical difficulty of the test. Where items cannot be matched exactly 
on p-value within an objective, one averages the differences over clusters of 
dbjectives within the same content subarea (Nassif, /insky, & Rubinstein, 
1979). 

This method has been used successfully in the Georgia Teacher 
Certification Program. 

Rasch Model .- According to- Wriqht (1968), the Rasch model calibrates test 
items ^ independent of the ability level of the examinee samp'. e -used for 
calibration purposes. Further, the measurement of examinees occurs 
independent of the difficulty of the test it has used for meas-rement pur|)oses. 

Since sample-free estimates of item difficulty with -respect to a commori 
score are obtained for all items, item banking is easily achieved. Parallel 
forms of t^sts are then created and equivalence of scoring is ensured by 
creation of test forms of known difficulty ajid dispersion. ^ 

The Florida and parts of the South Carolina Teacher Certification Testing 
Programs rely on the Rasoh model for creating equated tests used in successive 
administrations. Items from previJus administrations are' seeded onto 
subsequent test forms to observe shifts. Those .seeded items also provide a 
link back to the item bank. 
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Why Are These Methods Used? 

■ I " 

» . > >,.>»> 

e* ' , ' ' ' — 

• Linear equating is a straightforward procedure which accommdcldtes 
' varying amounts of item ove^^lap from one administraition to the next. 

'.Generally, it is advised tHat. H 'le'ast 25% of the test be anchored 
from one administration-to the next. 

I 

• Different linear equatingl methods available accommodate varying 
statistical assumptions or effects (e.g., Tucker & Levine). 

• Linear equating does not require a separate field test of the new 
replacement items, assuming sufficient sample size. 

• A,l models allow for, but may not require, content mapping and 
difficulty and discrimination match of the replacement with the 
i-'eplaceable items. 

c 

t The p-value approach for creating new forms accommodates teaching 
fields with low incidence of applicants in that ^data^e pooled over ' 
several administrations until an adequate data base has accumulated. 

f The Rasch model targets tests to examinees' ability level, so that 
greater efficiency in testing, is believed io be achieved. 

When Do Test Forms Need to be Changed? . 

If there is reason to believe that there has been a security break on the 
test, 9 new test form should be developed and administered. After a test form 
has been administered several times and there is reason to believe that the 
•performance on the test can be significantly affected- by multiple retakes 
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of the same exam, a new exam should be developed. Clearly, if the content 
domain or. job definition changes, corresponding changes should 'be reflected on 
the test.. 

In this time of restricted resources, no test administrator wants to 
develop more tests than are necessary. In the teaching fields with few 
examinees, multiple administrations of the same exam can be justified for the 
purpose of test statistical data collection. In- larger fields, the test form 
may be changed after it has been administered to an adequate number of , 
examinees (say, 250). This"" generally occurs at least once a year in these 
larger fields. < . 

Summary ' Qs, 

Technical apsects in teacher certification program design need care^ 
attention. Several states, notably Georgia, Florida, South Carolina,. 
Oklahoma, and Alabama have begun addressing these issues of validity, job 
analysis, standard setting and equating, a.id embarked o/i various developmental 
efforts. Other states are in the process of examining these very issues. 
Their solutions will be viewed with much interest. Many resources are 
available to th^ administrator/pol 'cy/decision' makers thrust into addressing 
matters of legal and t;echnical composition and consequence. The field is 
replete with the need for further developmental efforts in these issues and 
the"" corresponding talent and interest to satisfy those needs. 
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Job Analysis 

.Any instrument designed for certification or licensing, as is the^ 
case in teacher certification testing, must be shown to be job related. 
It rpust fairly mefiasure the content knowledge relevant to the job as per- 
formed by present job incumbeitts. Determining the job relatedness of 
content selected for inclusion in certification tests is both endorsed in 
tFie APA Principles for the ^Validation and Use of Personnel Selection 
Procedures (1980) and required by the Equal Employment Opportunity 
Commission Guidelines (1978). The guidelines require that the criteria 
used as a basjs of certification m'ust bear an empirical and logical rela- 
tionship to successful job performance. For purposes of teacher certifi-^ 
cation, this suggests that test content should reflect the content knowl- 
edge or pedagogical skills required for teaching. While there are a 
number of Ways in which thjs domain of knowledge can , be , identified (cf. 
Popham, ,1980), a systematic job analysis is recommended to establish an 
empirical and logical -relationship to teacher performance. 

Job Analysis Approaches 

Job analysis Ms a process of systematically collecting information 
about the elements of a job. While job analysis has been routinely used 
in personnel-related areas for close to a century, it is only within the 
past few decades that it has been employed in personnel t esting . 

A variety of approaches to assessing the elements of a given work 
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s.ituation^are available; however, regardless of the s'elected method, most 
approaches include- some determination of the critical and frequen-tTy per- 
formed elements of the- job. Importance (criticality or essentiality) and 
frequency of performance (time spent or percentage of time consumed on 

job) are the iwo key dimensions underlying most job analysis approaches.. 

/ ■ ' • 

Within the teacher certification arena, thi's would generally take the form 

of assessi^ the important and frtquently applied teaching skills or con- 
tent knowledge in the instructional setting. 

<]ob anal/sis approaches can be seen to vary along a number of dimen- 
;sions. Levine, Ash, Hall, and Sistrunit-(1981 ) have delineated three key 
dimensions along which job analyses vary: 

r 

t type of descriptor or element used to describe the job, 
• the source of job information, and 
t data collection methodology. 

\ 

Among the descriptors used to describe a job are tasks;, activities, 
skills, kfiow.ledge, and personal characteristics. A number pf sources of 
job information are potentially available; these include job incumbents, 
supervisors, trained job analysts, and written documents. Data \ col lection 
methods include questionnaires, interviews, observation, diaries, '-and 
actual job performance. Although it is clear that many approaches "to con- 
ducting job analyses are available, the application of jo&vjnalysis. 
methodology to teacher certification testing has been» somewhat limited. 
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Current applications of Job analysis methodology to teacher certifi- 
cation testing is presented below. Other Job. analysis approaches derived 
from the three dimensions cited previously, with potential applications to 
teacher certification, are offered following the discussion of current 
ap plications. , • " / 



Job Analysis Applications 



Job analysis has been used in tr.? content validation%of teacher cer- 
tification tests in a number of states'. Among the states that =have con- 

d iirtpH jnh analyspj; as Dar t of their teacher certification test develop- 

ment efforts are Georgia, Alabama, South Carolina, and Oklahoma. In all 

■ four cases a survey approach was used. A sample of educators within. the 

* t 

state were sent a survey instrument requesting them to" rate on a Ukert- 
type scale a series of content objectives, developed by. panels of content 
experts, in terms of the amount of time -spent teaching i'or using the objec- 
tives and the extent to .which the objectives were essential to the field. 
Based on the Job analysis results, those objectives foind to.lje most Job 
related were included in the content of the examinations. In some cases 
an interview procedure was used with a sample of educitors to supplement 
...the quantitative ratings and gather further information labout Job content. 

Similar procedures were used in the development of the ^Florida 
Teacher Certification Ex'amination. Teacher competencifes--{obJectives) were 
developed by a' panel of teacher educators. The cojhpetencie.s were then 



■sent to a sample of educators who rated the competencies in terms of tHeTr 
perceived "importance" to the fietd. N^' ratings of "frequency 6f use" or 
"time spent using" were c-^l.lected. 

Er|c . ^Bi 
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Sirtiilar procedures have been used f6r more process-oriented ^assess- 



een used 

ih' ifi"- 



merit measures developed for usV in'"t6acher' certification. The Basic 
Professional Studies Exami^iation developed in Alabama to assess knowledge 
of pedagogical skills rel'ied on job analysis for determining the con^tent 
to aooear on the test. A sample of educators across teaching fields rated 
the < frequency with which pedagogical skills were used and the importance 
of those i*ills. 'The content of the Performance Observation Instrument 
developed in South Carolina was defined through job analysis procedure. 
Again, using a survey approach, a sample of South Carolina -educators rated 
the importance ai^ frequency 'of use (as well as observability and 
relevance) of a series of teaching skills and behaviors. 

The development of the current NTE did^ not involve job analysis; 
however, the "Common" portion of the NTH is currently under revision, and 
a form of job analysis is freing used in defining the content to be 
included on the revised examination. Here, state representatives have 
been surveyed to determine the extent to which a proposed set of 
oedagogically related • topics are important for purposes .of teacher 

* * * • 

, certification. ' ■ 

^ Job Analysis Alternatives 



While job analyses conducted for current ^eacher certification^festp^ 
/have almost exclusively been limited to survey questionnaires requesting/ y 
job incumbents to rate proposed test content in terms of importance anc^K^ ^ 
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frequency of use, .other alternatives suitable for use in teacher certifi- 
cation testing are available. Recommendations for job analysis a''terna- 
tives' based on (1) type of data- collection methodology and (2) source of 
job information are provided below. 

Teacher ctrtif icition test development efforts, to date, have relied 
job inforiiiation fr.om a cross section of job incum- 
teaching area for which the measure is teing^ devel- 



on the, col lection of 
bents reflecting the 



oped. Alternatives to the use of r cross section of job incumbents 
include' the collecticnof job information from supervisors or solely from 
superior performers on the job. Previous research, comparing the job 
information obtained* from job incumbents* and other observers \i con- 

flicting (Levine et al., 1981 )f While the information obtained from 

/ " / 

incumbents and other observers appears to be consistent /in some job 
settings, Levine ef pi., (1981) suggest-'that '.ia other settings incumbents 
tend to provide ljess-«ccurate accounts of their job.contentj No specific 
attempts have been majde to investigate the information obtained from 
teachers as rompared to the informatit)n obtained from other, observers in 
the instructional environment, and the accuracy (Of teacher/edJicetor sup- 
plied information remains to be exflored. "Future job analysis efforts 
wi>hin the realm of teacher certi^cation should consider obtaining infor- 
mation from teacher supervisors ^(or outside observers) as well as ^rom 
teachers for purposes of comparison. 

Similarly, little effort has been made to compare the job information 

obtained from teachers juiged as superior to educators judged to .be poor 

\ ' ' ' - . ■ 

performers. While Levine e'|: al. .(1981), in their recent discussion of job 

analysis methodology, suggeW that there are few differences 4n tbe job 

« 



information obtained fro?n superior and less, capable perfonner-s in a vari- 
ety of job settings, this remains to be verified in the instructional 
setting. Future efforts to determine job-related content for inclusion in 
teacher certification assessment instruments should include the examination 
of the differences in information provided by educators exhibiting differ- 
ent levels of pe'rformance (previously identified by school personnel). 
However, the validity of teacher certification tests based on job content 
defined solely by superior performers could be brought into question as 
these measures are generally designed as minimum con^)etency assessments. 

Alternative data collection methods should be considered in job 
analysis efforts undertaken for teacher certification test development 
purposes. Among the alternatives to the survey^ questionnaire approach 
(which has been the primary data collection method employed for teacher 
certification testing to ^ate) are (1) observation* (2) critical incident 
technique (Flanagan, 1954), (3) document review, and (4)" group discussion. 

Job analysis data collection using observational methods relies on- 
trained observers observing the performance of job incumbents. Within the 
realm of teacher certification, this would involve trained observers 
observing the classroom behavior .of teachers or other instructional 
personnel to ascertain the content of the; job. While providing a direct 
assessment of the job content, the feasibility of this approach ir ques- 
tionable because of its obtrusiveness and resources required. This is 
particularly true in the case of content knowledge examinations developed 
for teacher certification purposes; repeated observations over an extended 
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period of time would be required to provide an accurate assessment of the 
content knowledge required on the job. 

. The critical incident approachy developed by Flanagan (1954), 
involves, the identification of. job events that have resulted in either 
inferior or superior performance (i.e., events that elicit, behaviors 
necessary for successful job performance). A large number of incidents 
are collected from job incumbents (through diaries, interviews, etc.) and 
are used to determine what behaviors are necessary to be effective on the 
job. In application, to teacher certification testing, this would require 
the- elicitation of critical incidents in the instructional setting from a 
pool of instructional personnel. This -approach is potentially useful for 
the development of teacher performance measures or tests focusing on peda- 
gogical skills; however, the critical incident technique appears to have 
little application to content knowledge-oriented measures. Levine et al., 
(1981.) -report that this approach was not favored by experienced job 
analysts for use in personnel selection. 

The final two data collection approaches with potential application 
to teacher certific¥tion are- document review and group discussion. Docu- 
ment review involves the use of available literature defining a job as a 
basis for determining necessary job- content. Here, job descriptions and 
other documentation would be reviewed to determine the critical aspects of 
the job to include in a personnel selection instrument. To the extent 
that such documents exist within educational environments, this approach 
could be employed. In fact, the review of such ^documents is already 
carried out. to a limited extent, in the definition of content knowledge 
or skills to be included on job analysis survey instruments used in 
existing teacher certification test development projects; Similarly, the 
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fourth and final method to be considered— group discussion— has been used 
in the development of existing teacher certification tests. In the devel- 
opment of certification tests for Georgia, South Carolina, Alabama, and 
Oklahoma i panels of experts were convened in the respective content areas 
to generate content for inclusion on the job analysis survey instrument. 
This could be expanded to include supervisors and incumbents in the 
respective areas who would formally rate the knowledges and skills 

ft 

identified in terms of their importance as is recommended by Primoff 
(1975). 

Whether the additional information gained from the use of these 
approaches warrants the large expenditure of resources remains to be 
seen. However,, additional research in this area is necessary to determine 
the effectiveness of current job analysis approaches employed within the 
realm of teacher certification, and to identify superior approaches to job 
^analysis in this setting. « 



Validity ^ 

One of the primary concerns -fn the" teacher 'certification ^measurement 
effort is validity. Validity refers to the ability of a measuring instru- 
mept -to do what it is intended to do (Nunnally, 1978), or, more specifi- 
cally, "the degree to which inferences from scores on tests or assessments 
•are justified or supported by evidence" (APA Principles," 1980, p. 2)./ 
Traditionally, and in licensing, three aspects of validity are discussed: 
_criterion-relate^validity (predictive and concurrent), content validity, 
and construct validity (APA Standards, 1974-). -CrJteciortrelated_ validity 
"is of concern when one wishes to infer, from a given .instrument, an indi- 
vidual's performance on some other variable referred to, as the criterion 
(APA Standards, 1974; Nunnally, 1578). Content validity is of importance 
when one wishes to estimate "how an- individual performs in the universe of 
situations the test is intended to represent" (APA Standards, 1974). The 
third aspect of validity, construct validity, references the extent to 
which a measurement tool is related to the various elements or underlying 
traits— assoe-iated .with the psychological construct it is purported to 
measure. i'. ' . 

Validity is of particular concern in the development and use of per- 
.sonnel screening instruments where one v<,ishes to establish that a test 
does indeed truly measure the important aspects o*f job performance it is 
purported to measure. It is imperative that a relationship between 
teacher certification decisions based on a measurement instrument and 
aspects of the job required for ^successful performance in the classroom be 
established. Most of the validal^on attempts for teacher certification. 
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tests have focused on content validity. The key concern within teacher 
certification testing has been to ensure that the tests developed reflect 
the significant aspects of the teaching profession for which they are 
designed. At a minimum, the content of certification instruments should 
be drawn from Important elements of the teaching job. 

A discussion of the validjition of teacher certification assessment 
measures is provided below. As most of the validation in teacher certifi- 
cation tests is focused on content validation, the focus of the discussion 
provided. is on content validity. 

~CorrtenFVari"d1W ~' — ^ . ^ ■ _ 

The content validity of a test is established by. demonstrating, that 
the content included within the instrument represents a sample of the con- 
tent or behavior Included 1n the performance domain. Content validation, 
as applied to teacher certification, generally has two components:* (1) 
determining whether the test content reflects significant aspects of the 
educator's job (and measures those aspects proportionally), and (2) deter- 
mining whether test items developed accurately measure that job's content. 
The first component is often assessed through some form of job analysis 
and is discussed at length in earlier. sections of this paper. Discussion 
of the second area, item validation, is presented in the following 
, sections. 

Content Validation Approaches . A variety of apjaroaches to assessing 
item validity are available to the practitioner. Among the methods avail- 

/ ; ^ ■ ] 
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able to b.e 'considered here are (a) index of item-objective congruence, 
, * (b) rating scale approach, and (c) dichotomous judgment tnodel. Within 

each of the. above approaches, a panel of judges evaluates examination 
Items on= an item-by-item basis to detemiine if the item is a valid measure 
of the domain (objectives, item specification, topic) for which it was 
written. . ^ 

Within the item-objective congruence model , 'content experts are asked 
to assign ratings of +1 (item 'measure^f the objective), 0 (undecided 
whether item measures the objective) apd -1 (item does not measilre the 
objective) to each item. Judges are asked to rate each item against each 
objective. An index of Hem-objective congruence (ranging from 1 to -1)," 
developed by Rov.inellt and Hambleton (1977), can then be computed for each 
item, and a cutoff score for identifying items as valid or invalid can be 
established. • 

The rating scale approach (Hambleton, 1980) i'nvolve§^ expert judges 
assessing each item as a measure of its intended objective, on a rating 
scale. The mean or median score across judges is computed and a cutoff 
score for accepting items as valid is set. The index of item-objective • 
congruence and rating scale procedure/" are described in more depth in 
Hambleton (1980). 

,A third approach available is the dichotomous judgment model - 
(Nassif, 1978). Here a panel of content experts indicate, for each item, 
whether^ they feel the item is or is not a valid measure of the objective 
for which it was written. Item validity is defined as having four parts: 
accuracy, congruence with objective, significance, and lack of bias. The 
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results from the content expert evaluations for each item are compared to 
the binomi-alo distribution to determine^ the probability, due to chance 
alone, of obtaining "x" valid responses for an item from a total of "N" 
raters. Items receiving ratings meeting statistical significance are 
treated as valid. 

Content Validation Applications . Content validation procedures have 
been employed in a variety of teacher certification testing efforts. The 
dichotomous judgment model has been widely used in the validation of 
content knowledge examinations in a number of states. Thjs appro.ach has 
been ^ used in the development of certification tests for educators in 
Georgia, where panels of approximately 15 content experts in each field 
for which eiaminations were developed were asked to make dichotomous 
judgments about prospwSR^e Items^to be included on the test. Por each 
item where the probability of obtaining "x" valid responses fron "N" 
raters due to chance alone was less than .10, the item was categorized as 
valid. Similar procedures were employed in the development .of content 
examinations for teacher certification in Alabama and Oklahoma. The 
dichotomous judgment model was also applied in the development ^ of the 
Pnglish Language Proficiency Examination to be administered to individuals 
seeking admission to. teacher education programs in Alabam?.' 

The content validation of the teacher certification tests developed 
for Florida involved the- review of test^ items by two independent panels of 
experts in the four subtest areas. The . panels reviewed the items based on 
supplied criteria (e.g., item-competency match, biss) and recommended 
acceptance, rejection, or revision of each item. 
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Post hoc efforts to establish the validity of the National Teacher's 
Examination (NTE) have been undertaken in a number of states. In South 
Carolina, a validation study was undertaken to, among other reasons, 
establish the content validity pf the NTE* Panels of content experts in 
the various, NTE teaching areas were asked to judge whether the individual 
questions appearing on ^he examination were covered in the curriculum of 
.South Carolina teacher education programs. This is akin to the dichot- 
omous judgment model presented earlier; however, 511 or more of the judges 
citinji the item as congruent with the curriculum was. employed as the 
criterion for accepting items as valid, rather than relying on compari- 
sons to the binomial distribution. SiivMli^ validatiorj efforts for the NTE 
are planned or are underway in Arkansas, Kentucky, Virginia, and '^Tennessee. 

Item validation for the. South Carolina Teaching Area Examinations, 
administered to teachers exiting teacher education programs for purposes 
of initial certification, relied on the rating scale approach. • Panels of 
South Carolina educators rated each item developed on a scale from 1 to 5 
ranging from "clearly valid" to "clearly not valid." Items receiving mean 
ratings, below 3.0 across judges, .^ere treated as varlid. 

There has been^ little attempt to apply item-objective congruence 
models in teacher certification to ^ date. The primary reason for . its 
absence from the teacher certification area stems from issues of 
feasibility. The approach is quite time-consuming and potentiaMy quite 
costly to the consumer. For example; if there are 50, objectives and 100 
test items, each judge must make 5,000 judgments. 



104 



The rating, scale approach" arid dichotomous ^judgment model offer prac- 
tical advantages; they are relatdvely simple to administer, and the analy- 
Sis associated with them is faiVly straightforward. The dichotomous 
judgment model, when used in conjunction with the binomial distribution, 
offers the added advantage of preventing the assessment of items deter- 

0 

mined to be valid bashed on chance alone. 

While content validity is clearly an important element in the devel- 
opment cf teacher certification tests, a number of measurement specialists 
have emphasized that content validity is an insufficient criterion for 
establishing the validity of a test. Messick- (1975) and, more recently, 
Hambleton (1980) note that content validity does not provide evidence 
regarding the uses of or inferences made from test scores. Despite the 
importance assigned to .criteribnrrelated validity and construct validity, 
few validation studies in this area have been conducted in the teacher 
certification field. These -issues are discussed in greater depth in the 
following sections. 

Criterion-Related Validity 

t 

, Criterion-related validity "compares test scores or predictions made 
from th6m, with an external variable, (criterion) considered to provide a 
direct measure of the characteristic^ or, behavior in question" (Cronbach, 
1971, p. 444). Criterion-related validity, as appliec^^to teacher certif- 
Icatlon, ejjmines the reUtionship between an Instrument administered fot- 
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certification purposes and actual teacher performance on the. job. A 
teacher certification test should accurately predict that aspect of 
teacher competency for which it was designed. 

Two forms of criterion-related validation are generally discussed: 
(1) concurrent validity, and (2) predictive validity (APA Standards, 
1974). "Statements of concurrent validity indicate the extent to which 
the test may be used to estimate an individual's present standing on the 
criterion," whereas predictive validity refers to "the extent to which an 
Individual's future^ level on a criterion can be predicted from a knowledge 
of prior test performance" (APA Standards, 1974, p. 26). Concurrent 
validation, as. applied to teacher certification testing, examines the 
relati&nship between the test scores of practicing educators (job incum- 
bents) and currfent performance. Establishing the predictive validity of a 
teacher certification measure involves tRe examination of the relationship 
between the test scores of prospective teachers (job applicants) and 
future performance." Both forms of criterion-related validity .are con- 
cerned with the accuracy of the measures in predicting teacher competency. 

While criterion-related validity has been held as a necessary part of 
validating certification tests, a number of obstacles have prevented the 
execution of criterion-related validation ^studies for teacher certifi- 
:cation measures. Hecht (1976), while' supporting the importance . of 
criterion-related validation for licensing and certification tests,, notes 
that criterion-related validation studies are "difficult- to develop, time- 
consuming, impractical for numerous reasons, and 'expensive" (p. 8). 
Nassif, Gorth, and Rubinstein (1977) provide a more in-depth treatment of 
these- issues,- as- they- relate specifically to teacher certification 
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testing. Nassif et ah (1977) suggest that 1n. order to demonstrate the 
predictive validity of teacher certification tests, the following criteria 
are required: 

(1) admission of all applicants for employment in the, field; 

(2) sufficient time lapse before observing the criterion variable; 

. ^ (3) unexamined, unused results of the test, i:.e., the predictor 
stored until correlated with the criterion (here, retention or 
di smi ssal of teacher due to subject-matter competence/ 
Incompetence); 

■ ' \ 

(4) the criterion must be . measMrable, i.e., a mechanism for 
accurately and reliably? "collecting the reasons for retention or 
dismissal of teachers (criterion) which clearly separates 
content knowledge as one of those reasons; 

(5) sufficient sample size; and <^ 

(6) stability of the criterion. 

t • 

However, these factors are usually not present in a certification 

/ • 

program. Problems associated , with conducting a criterion-related valida- 
tion study for teacher certification tests are discussed at length by 
Nassif et al. (1977). ^ 
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Construct validity is aimed at answering the question "Does the test 
measure the attribute it is said to measure" (Crbnbach,^ 197L). Construct '^ 
validation is a process (rather than a single study) whereby eyitjence, > 
relating test scores to the attributes of the construct the test is pur- 
ported to measure, is accumulated. ^Cronbach (1971) notes that when state- • 
ments are made that test scores reflect levels of a certain skill J^r 
knowledge, one is "constructing" an interpretation of these scores, and 
construct validation is of necessity. : While 'the constructs ^underlying 
teacher certifijcation measures are somewhat simpler than .those encountered 
in a more complex' and abstract personality cbnstruct 'such as "a^gressiyer 
nes^s," there exists an underlying construct (i.e., "pedagogical skill" -or^ 
"content knowledge").^* ; , ^ -^^^ ?" 

The construct validation of tes^s 'designed for teacher certification 
presents a number of prot)lems, and^* as 'such,) there Kas been little effort 
to construct validate existing teachef certification measures. ' Potential' 
approaches tOj and problems inherertt In, '<:onducting construct validation 
studies in this area and the inherent problems in conducting construct 
validation studies in tffis ai;*-ea are discussed below.. 

One of the primary methods for establishing 'the construct validity of 

a given'measure is* to establish a relationship between that measure and . 

oth^ measures of ' the same^ construct. For content knpwledge te^ts »used 

I 

for teacher certification purposes, this would require'^ comparison of the ^ 

■ ■ ' - -7 



ERIC 



108 . - 



-18- 



tests with other assessments of applicants' content knowledge. Similarly, 
performance or pedagogical skill certification tests would be -compared 
with oalternative performance or pedagogical skill assessments. Attempts 
to construct validate teacher certification tests using alternative 
measure^ of the construct suffer from many of the problems noted earlier 
In our discussion of criterion-related validity, notably the location of a 
suitable criterion measure* and the stability of that criterion. A "well- 
matched** criterion measure adequately measuring the construct reflected in 
the test to be validated is often unavailable. Moreover, the use of 
instructor or supervisor assessments of a candidate's proficiency are 
unsuitable as criterion measures for construct validation because of the 
unreliability and questionable accuracy of such criteria. 

While it is difficult to obtain suitable criterion measures for use 
in the construct validation of teacher certification tests, Hambleton 
(-1980)* notes that construct' validation should also be aimed at examining 
possible sources of error that reduce, the validity of test scores. Among 
the factors suggested for consideration by Hambleton (1980) applicable to 
teacher certification are the effects of t-est administration procedures, 
examinee' test taking skills, and examinee motivation. Although Jittle 
attempt has been made to investigate the impact of these factors on 
^teacher certification, future validation efforts in this area should 
include the consideration of these factors. Another approach to construct 
validation, suggested by Hambleton (1980),.. involves the use of factor 
analysis to verify the domain structure of the test. One would expect the 
factor structure of the test to correspond to the domain structure of the 



E^C • • 109 



test design, with indlYldual test items loading on a single factor corre- 
sponding to the appropriate domain. This approach has been employed in 
the development of the teacher performance assessment instruments in 
Georgia and South Carolina* 

ReH ability 

Reliability concerns the extent to which a measure consistently 
produces the same result under similar conditions (Nunnally, 1978), As 
with any measurement effort, the reliability of assessment instruments 
.used is a key concern .in teacher certification tests. -Traditionally, 
reliability has been thought of in terms of the internal consistency of a 
test or the stability of test scores across repeated administrations and 
parallel forms of the test.- More recently, particularly"! tn the area of 
certification, test developers have begun to examine reliability in terms 
of the dependability of classification decisions (e.g., pass/fail). 
Traditional and more recent approaches to teacher certification test 
reliability, and their current applications, are cbnsidered below. 

Approaches 

A number of methods for determining the reliability of teacher certif- 
ication tests are available. <Traditionany, three approaches to reli- 
ability have been employed: (1) stability, (2) equivalence, and (3) inter- 
nal consistency. Stability refers to the consistency of the measurement 
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over time, while equivalence estimates are obtained to determine the con-^ 
sistency of measurement across two or more forms of the test. The 
^ ^ Internal consistency of a test refers to the consistency of items included 
within a single test form. By far the most ^common approach is internal 
consistency estimates because of the need for only one test form and' the 
ease with which these estimates can be obtained. 

The most coirmon approach to assessing the stability of a test over 
time is the test-retest method, where the same test is administered to a 
single group of individuals at two different points in time. The 
correlation between the scores at T^ and Tg is obtained as an estimate 
of the test's reliability (Nunnally, 197S). Similarly, the reliability of 
two alternative forms (equivalence) can be determined by administering two 
forms of a test to a pool of examinees and computing the correlation 
between th|^ two. sets of scores as an estimate of test reliability 
(Nunnally, 1978). However, this approach has little application in 
teacher certification testing, as only a single test form Is employed in 
most certification programs. 

'Internal consistency approaches estimate test reliability using a 
single test form. Two Internal consistency approaches are generally 
employed: split-half reliability and the Kuder-Richardson indices of item 
homogeneity (K-R20, K-R21; Nunnally, 1978). The former approach involves 
the splitting of. a test Into two halves and correlating the two sets of 
'Items .as an estimate of Internal reliability. The latter approach 
examines the average of all possible spHt-half reliability coefficients. 
While both Internal consistency approaches are used, the Kuder-Richardson 
formulas are considered more accurate and hence are employed with greater 
frequency. 
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More recently, a- number of writers ,(cf. Huynh, 1976) have suggested 
that the reliability of tests, in situations where a dichotomous decision 
is made on the basis of test scores, should be assessed on the basis of 
the consistency of the decisions across test administrations. It has. been 
suggested that this is particularly applicable in criterion-referenced 
testing where the problem of restricted range of test scores may be 
present. These approaches would seem particularly applicable to the 
teacher certification •. testing area where dichotomous . master/nonmaster 
decvisions, are made. 

While a number of decision-consistency approaches have emerged in 
recent years, only a sample of the more visible approaches applicable to 
teacher certification are presented here. Among the available approaches 
discussed here are, Kappa reliability (Swaminathan, .Hambletgn, and Algina,. 
1974; Huynh, 1976; Subkoviak, 1980) and generalizability analysis 
(Brennan, 1980). ' 

The Kappa, reliability approach examines the consistenc> of classifi- 
cation decisions across test administrations. The extent of actual 
agreement across t«t administrations (computed by calculating the. 
proportion 50f .examinees consistently classified in a given mastery state 
on two administrations) is compared to fhe .extent of agreement that could 
be expected by chance alone.- These two -facets are used to calculate a 
coefficient of decision-consistency. Specific procedures for computing 
Kappa are^describe((^in Swaminathan et al., 1974. Procedures for obtaining 
Kappa reliability estimates from a single test administration are discussed 
in Huynh (1976) and Subkoviak (1980): The assessment of reliability using 
generalizability theory employs estimates of the variance components 
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attributable to the various elements in the assessment situation (e.g., 
items, persons). Reliability, then, is vifewed as a function of the 
proportion of variance accour\ted for by the person component. . 

.Applications 

Whjle there are a considerable number of approaches to examining test 
reliability, the diversity of applications , of test reliability methods to 
teacher certification testing has been somewhat ^limited. 

The traditional .approaches to reliability have been extensively 
applied in teacher certification, particularly internal consistency 
a|Droaches. For virtually all'teacher certification measurement efforts, 
internal reliability estimates have been obtained. K-R20 reliability 
coefficients are routinely obtained for teacher certification tests 
administered in Georgia, Alabama, Oklahoma, and other statewide 
.certification programs, as well as for the HTE. This is, not surprising as 
these estimates are reasonably easy to obtain and provide a reasonable 
assessment of test reliability. 

With increased criticism of more traditional reliability approaches, 
test developers in the area of teacher certification have begun to employ 
decision-consistency models and generalizabillty analysis with increased 
frequency. The reliability of the teacher performance assessment instru- 
ment in Georgia was recently examined using generalizability analysis. 
Both decision-consistency (Subkoviak, 1980) and generalizability analysis 
were applied.^in the assessment of the reliability of the Georgia teaching 
field examinations. 
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